数据探索简易工具——easyeda

概述:简易数据统计(非空率;非空非零率;非空非零非负1率;取值类别数;众数;相关性;同分布性;取值分布)

Install

1
pip install easyeda

iris数据实例练习

(1)准备数据

1
2
3
4
5
6
from easyeda import eda
import pandas as pd
from sklearn.model_selection import train_test_split
iris = pd.read_csv("plotly-datasets/iris.csv")#预先下载好iris数据集
iris["label"] = iris.Name
iris.head()
SepalLength SepalWidth PetalLength PetalWidth Name label
0 5.1 3.5 1.4 0.2 Iris-setosa Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa Iris-setosa

(2)将数据集中的类别属性标记为label且转为数值类型

1
2
3
4
5
6
7
8
9
10
11
12
iris.groupby('Name').count().index

name_to_label = {
'Iris-setosa':0,
'Iris-versicolor':1,
'Iris-virginica':2
}

iris['label'] = iris['Name'].map(name_to_label)
del iris['Name']

iris.head()
SepalLength SepalWidth PetalLength PetalWidth label
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

(3)划分数据集和设置语言

1
2
3
4
dftrain,dftest = train_test_split(iris,test_size = 0.3)#7:3=训练集:测试集
dfeda = eda(dftrain,dftest,language="Chinese")

dfeda
非空率 非空非零率 非空非零非负1率 取值类别数 众数 相关性 同分布性 取值分布
SepalLength 1.0 1.000000 1.000000 33 5.0 0.842995 0.888889 [(5.0, 9), (5.1, 7), (6.7, 7), (5.7, 7), (6.3,…
SepalWidth 1.0 1.000000 1.000000 22 3.0 0.615942 0.774603 [(3.0, 18), (3.2, 10), (3.1, 9), (3.4, 9), (2….
PetalLength 1.0 1.000000 1.000000 39 1.5 1.000000 0.917460 [(1.5, 9), (5.1, 7), (1.3, 7), (4.5, 6), (1.6,…
PetalWidth 1.0 1.000000 1.000000 22 0.2 1.000000 0.930159 [(0.2, 20), (1.5, 9), (1.8, 7), (1.4, 7), (1.3…
label 1.0 0.657143 0.657143 3 0.0 1.000000 0.968254 [(0.0, 36), (2.0, 35), (1.0, 34)]