概述:简易数据统计(非空率;非空非零率;非空非零非负1率;取值类别数;众数;相关性;同分布性;取值分布)
Install
iris数据实例练习
(1)准备数据
1 2 3 4 5 6
| from easyeda import eda import pandas as pd from sklearn.model_selection import train_test_split iris = pd.read_csv("plotly-datasets/iris.csv") iris["label"] = iris.Name iris.head()
|
SepalLength |
SepalWidth |
PetalLength |
PetalWidth |
Name |
label |
|
0 |
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
Iris-setosa |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
Iris-setosa |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
Iris-setosa |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
Iris-setosa |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
Iris-setosa |
(2)将数据集中的类别属性标记为label且转为数值类型
1 2 3 4 5 6 7 8 9 10 11 12
| iris.groupby('Name').count().index
name_to_label = { 'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2 }
iris['label'] = iris['Name'].map(name_to_label) del iris['Name']
iris.head()
|
SepalLength |
SepalWidth |
PetalLength |
PetalWidth |
label |
|
0 |
5.1 |
3.5 |
1.4 |
0.2 |
0 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
0 |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
0 |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
0 |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
0 |
(3)划分数据集和设置语言
1 2 3 4
| dftrain,dftest = train_test_split(iris,test_size = 0.3) dfeda = eda(dftrain,dftest,language="Chinese")
dfeda
|
非空率 |
非空非零率 |
非空非零非负1率 |
取值类别数 |
众数 |
相关性 |
同分布性 |
取值分布 |
|
SepalLength |
1.0 |
1.000000 |
1.000000 |
33 |
5.0 |
0.842995 |
0.888889 |
[(5.0, 9), (5.1, 7), (6.7, 7), (5.7, 7), (6.3,… |
SepalWidth |
1.0 |
1.000000 |
1.000000 |
22 |
3.0 |
0.615942 |
0.774603 |
[(3.0, 18), (3.2, 10), (3.1, 9), (3.4, 9), (2…. |
PetalLength |
1.0 |
1.000000 |
1.000000 |
39 |
1.5 |
1.000000 |
0.917460 |
[(1.5, 9), (5.1, 7), (1.3, 7), (4.5, 6), (1.6,… |
PetalWidth |
1.0 |
1.000000 |
1.000000 |
22 |
0.2 |
1.000000 |
0.930159 |
[(0.2, 20), (1.5, 9), (1.8, 7), (1.4, 7), (1.3… |
label |
1.0 |
0.657143 |
0.657143 |
3 |
0.0 |
1.000000 |
0.968254 |
[(0.0, 36), (2.0, 35), (1.0, 34)] |