2020-09-17

数据探索简易工具——easyeda

概述：简易数据统计（非空率；非空非零率；非空非零非负1率；取值类别数；众数；相关性；同分布性；取值分布）

Install

1	pip install easyeda

iris数据实例练习

（1）准备数据

from easyeda import eda
import pandas as pd
from sklearn.model_selection import train_test_split
iris = pd.read_csv("plotly-datasets/iris.csv")#预先下载好iris数据集
iris["label"] = iris.Name
iris.head()

SepalLength	SepalWidth	PetalLength	PetalWidth	Name	label
0	5.1	3.5	1.4	0.2	Iris-setosa	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa	Iris-setosa

（2）将数据集中的类别属性标记为label且转为数值类型

iris.groupby('Name').count().index

name_to_label = {
    'Iris-setosa':0,
    'Iris-versicolor':1,
    'Iris-virginica':2
}

iris['label'] = iris['Name'].map(name_to_label)
del  iris['Name']

iris.head()

SepalLength	SepalWidth	PetalLength	PetalWidth	label
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

（3）划分数据集和设置语言

dftrain,dftest = train_test_split(iris,test_size = 0.3)#7:3=训练集：测试集
dfeda = eda(dftrain,dftest,language="Chinese")

dfeda

非空率	非空非零率	非空非零非负1率	取值类别数	众数	相关性	同分布性	取值分布
SepalLength	1.0	1.000000	1.000000	33	5.0	0.842995	0.888889	[(5.0, 9), (5.1, 7), (6.7, 7), (5.7, 7), (6.3,…
SepalWidth	1.0	1.000000	1.000000	22	3.0	0.615942	0.774603	[(3.0, 18), (3.2, 10), (3.1, 9), (3.4, 9), (2….
PetalLength	1.0	1.000000	1.000000	39	1.5	1.000000	0.917460	[(1.5, 9), (5.1, 7), (1.3, 7), (4.5, 6), (1.6,…
PetalWidth	1.0	1.000000	1.000000	22	0.2	1.000000	0.930159	[(0.2, 20), (1.5, 9), (1.8, 7), (1.4, 7), (1.3…
label	1.0	0.657143	0.657143	3	0.0	1.000000	0.968254	[(0.0, 36), (2.0, 35), (1.0, 34)]

SepalLength	SepalWidth	PetalLength	PetalWidth	label
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

SepalLength	SepalWidth	PetalLength	PetalWidth	label
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

SepalLength	SepalWidth	PetalLength	PetalWidth	label
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2