2.1 EDA目標
-
EDA的價值主要在於熟悉數據集,瞭解數據集,對數據集進行驗證來確定所獲得數據集可以用於接下來的機器學習或者深度學習使用。
-
當了解了數據集之後我們下一步就是要去了解變量間的相互關係以及變量與預測值之間的存在關係。
-
引導數據科學從業者進行數據處理以及特徵工程的步驟,使數據集的結構和特徵集讓接下來的預測問題更加可靠。
-
完成對於數據的探索性分析,並對於數據進行一些圖表或者文字總結並打卡。
2.2 內容介紹
- 載入各種數據科學以及可視化庫:
- 數據科學庫 pandas、numpy、scipy;
- 可視化庫 matplotlib、seabon;
- 其他;
- 載入數據:
- 載入訓練集和測試集;
- 簡略觀察數據(head()+shape);
- 數據總覽:
- 通過describe()來熟悉數據的相關統計量
- 通過info()來熟悉數據類型
- 判斷數據缺失和異常
- 查看每列的存在nan情況
- 異常值檢測
- 瞭解預測值的分佈
- 總體分佈概況(無界約翰遜分佈等)
- 查看skewness and kurtosis
- 查看預測值的具體頻數
- 特徵分爲類別特徵和數字特徵,並對類別特徵查看unique分佈
- 數字特徵分析
- 相關性分析
- 查看幾個特徵得 偏度和峯值
- 每個數字特徵得分佈可視化
- 數字特徵相互之間的關係可視化
- 多變量互相迴歸關係可視化
- 類型特徵分析
- unique分佈
- 類別特徵箱形圖可視化
- 類別特徵的小提琴圖可視化
- 類別特徵的柱形圖可視化類別
- 特徵的每個類別頻數可視化(count_plot)
- 用pandas_profiling生成數據報告
導入包
#coding:utf-8
#導入warnings包,利用過濾器來實現忽略警告語句。
!pip install missingno
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno# 缺失值分析的代碼
Collecting missingno
Downloading https://files.pythonhosted.org/packages/2b/de/6e4dd6d720c49939544352155dc06a08c9f7e4271aa631a559dfbeaaf9d4/missingno-0.4.2-py3-none-any.whl
Requirement already satisfied: numpy in /home/ach/anaconda3/lib/python3.7/site-packages (from missingno) (1.17.2)
Requirement already satisfied: matplotlib in /home/ach/anaconda3/lib/python3.7/site-packages (from missingno) (3.1.1)
Requirement already satisfied: scipy in /home/ach/anaconda3/lib/python3.7/site-packages (from missingno) (1.3.1)
Requirement already satisfied: seaborn in /home/ach/anaconda3/lib/python3.7/site-packages (from missingno) (0.9.0)
Requirement already satisfied: cycler>=0.10 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (2.8.0)
Requirement already satisfied: pandas>=0.15.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from seaborn->missingno) (0.25.1)
Requirement already satisfied: six in /home/ach/anaconda3/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->missingno) (1.12.0)
Requirement already satisfied: setuptools in /home/ach/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->missingno) (41.4.0)
Requirement already satisfied: pytz>=2017.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn->missingno) (2019.3)
Installing collected packages: missingno
Successfully installed missingno-0.4.2
導入數據
## 1) 載入訓練集和測試集;
Train_data = pd.read_csv('/home/ach/下載/數據挖掘/used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('/home/ach/下載/數據挖掘/used_car_testA_20200313.csv', sep=' ')
所有特徵集均脫敏處理(方便大家觀看)
- name - 汽車編碼
- regDate - 汽車註冊時間
- model - 車型編碼
- brand - 品牌
- bodyType - 車身類型
- fuelType - 燃油類型
- gearbox - 變速箱
- power - 汽車功率
- kilometer - 汽車行駛公里
- notRepairedDamage - 汽車有尚未修復的損壞
- regionCode - 看車地區編碼
- seller - 銷售方
- offerType - 報價類型
- creatDate - 廣告發布時間
- price - 汽車價格
- v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根據汽車的評論、標籤等大量信息得到的embedding向量)【人工構造 匿名特徵】
#The tail() function is used to return the last n rows.
## 2) 簡略觀察數據(head()+shape)
Train_data.head().append(Train_data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
149995 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
149996 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
149997 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
149998 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
149999 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
10 rows × 31 columns
Train_data.shape
(150000, 31)
Test_data.head().append(Test_data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 150000 | 66932 | 20111212 | 222.0 | 4 | 5.0 | 1.0 | 1.0 | 313 | 15.0 | ... | 0.264405 | 0.121800 | 0.070899 | 0.106558 | 0.078867 | -7.050969 | -0.854626 | 4.800151 | 0.620011 | -3.664654 |
1 | 150001 | 174960 | 19990211 | 19.0 | 21 | 0.0 | 0.0 | 0.0 | 75 | 12.5 | ... | 0.261745 | 0.000000 | 0.096733 | 0.013705 | 0.052383 | 3.679418 | -0.729039 | -3.796107 | -1.541230 | -0.757055 |
2 | 150002 | 5356 | 20090304 | 82.0 | 21 | 0.0 | 0.0 | 0.0 | 109 | 7.0 | ... | 0.260216 | 0.112081 | 0.078082 | 0.062078 | 0.050540 | -4.926690 | 1.001106 | 0.826562 | 0.138226 | 0.754033 |
3 | 150003 | 50688 | 20100405 | 0.0 | 0 | 0.0 | 0.0 | 1.0 | 160 | 7.0 | ... | 0.260466 | 0.106727 | 0.081146 | 0.075971 | 0.048268 | -4.864637 | 0.505493 | 1.870379 | 0.366038 | 1.312775 |
4 | 150004 | 161428 | 19970703 | 26.0 | 14 | 2.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.250999 | 0.000000 | 0.077806 | 0.028600 | 0.081709 | 3.616475 | -0.673236 | -3.197685 | -0.025678 | -0.101290 |
49995 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 |
49996 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 |
49997 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 |
49998 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 |
49999 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 |
10 rows × 30 columns
Test_data.shape
(50000, 30)
數據概況總覽
## 1) 通過describe()來熟悉數據的相關統計量
Train_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 149999.000000 | 150000.000000 | 145494.000000 | 141320.000000 | 144019.000000 | 150000.000000 | 150000.000000 | ... | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.129021 | 8.052733 | 1.792369 | 0.375842 | 0.224943 | 119.316547 | 12.597160 | ... | 0.248204 | 0.044923 | 0.124692 | 0.058144 | 0.061996 | -0.001000 | 0.009035 | 0.004813 | 0.000313 | -0.000688 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.536040 | 7.864956 | 1.760640 | 0.548677 | 0.417546 | 177.168419 | 3.919576 | ... | 0.045804 | 0.051743 | 0.201410 | 0.029186 | 0.035692 | 3.772386 | 3.286071 | 2.517478 | 1.288988 | 1.038685 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.168192 | -5.558207 | -9.639552 | -4.153899 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243615 | 0.000038 | 0.062474 | 0.035334 | 0.033930 | -3.722303 | -1.951543 | -1.871846 | -1.057789 | -0.437034 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | ... | 0.257798 | 0.000812 | 0.095866 | 0.057014 | 0.058484 | 1.624076 | -0.358053 | -0.130753 | -0.036245 | 0.141246 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265297 | 0.102009 | 0.125243 | 0.079382 | 0.087491 | 2.844357 | 1.255022 | 1.776933 | 0.942813 | 0.680378 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19312.000000 | 15.000000 | ... | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
8 rows × 30 columns
Test_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 | 48587.000000 | 47107.000000 | 48090.000000 | 50000.000000 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 174999.500000 | 68542.223280 | 2.003393e+07 | 46.844520 | 8.056240 | 1.782185 | 0.373405 | 0.224350 | 119.883620 | 12.595580 | ... | 0.248669 | 0.045021 | 0.122744 | 0.057997 | 0.062000 | -0.017855 | -0.013742 | -0.013554 | -0.003147 | 0.001516 |
std | 14433.901067 | 61052.808133 | 5.368870e+04 | 49.469548 | 7.819477 | 1.760736 | 0.546442 | 0.417158 | 185.097387 | 3.908979 | ... | 0.044601 | 0.051766 | 0.195972 | 0.029211 | 0.035653 | 3.747985 | 3.231258 | 2.515962 | 1.286597 | 1.027360 |
min | 150000.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.160049 | -5.411964 | -8.916949 | -4.123333 | -6.112667 |
25% | 162499.750000 | 11203.500000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243762 | 0.000044 | 0.062644 | 0.035084 | 0.033714 | -3.700121 | -1.971325 | -1.876703 | -1.060428 | -0.437920 |
50% | 174999.500000 | 52248.500000 | 2.003091e+07 | 29.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 109.000000 | 15.000000 | ... | 0.257877 | 0.000815 | 0.095828 | 0.057084 | 0.058764 | 1.613212 | -0.355843 | -0.142779 | -0.035956 | 0.138799 |
75% | 187499.250000 | 118856.500000 | 2.007110e+07 | 65.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265328 | 0.102025 | 0.125438 | 0.079077 | 0.087489 | 2.832708 | 1.262914 | 1.764335 | 0.941469 | 0.681163 |
max | 199999.000000 | 196805.000000 | 2.015121e+07 | 246.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.291618 | 0.153265 | 1.358813 | 0.156355 | 0.214775 | 12.338872 | 18.856218 | 12.950498 | 5.913273 | 2.624622 |
8 rows × 29 columns
## 2) 通過info()來熟悉數據類型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null float64
brand 50000 non-null int64
bodyType 48587 non-null float64
fuelType 47107 non-null float64
gearbox 48090 non-null float64
power 50000 non-null int64
kilometer 50000 non-null float64
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null int64
offerType 50000 non-null int64
creatDate 50000 non-null int64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 50000 non-null float64
v_13 50000 non-null float64
v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
缺失值和異常值分析
## 1) 查看每列的存在nan情況
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
Test_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 1413
fuelType 2893
gearbox 1910
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
# nan可視化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]# 取出大於0的
missing.sort_values(inplace=True)# 升序排列
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f5add5da890>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Rat0Ogt3-1584885647200)(output_19_1.png)]
- 通過以上兩句可以很直觀的瞭解哪些列存在 “nan”, 並可以把nan的個數打印,主要的目的在於 nan存在的個數是否真的很大,如果很小一般選擇填充,如果使用lgb等樹模型可以直接空缺,讓樹自己去優化,但如果nan存在的過多、可以考慮刪掉
# 可視化看下缺省值
msno.matrix(Train_data.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x7f5adc9c3390>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-1W3ADPqM-1584885647201)(output_21_1.png)]
msno.bar(Train_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f5adc86b090>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-yUUKgDZP-1584885647202)(output_22_1.png)]
# 可視化看下缺省值
msno.matrix(Test_data.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ada277450>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-lBPoeuSV-1584885647203)(output_23_1.png)]
msno.bar(Test_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ada2ca050>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-KGNRJexG-1584885647204)(output_24_1.png)]
測試集的缺省和訓練集的差不多情況, 可視化有四列有缺省,notRepairedDamage缺省得最多
異常值檢測
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
可以發現除了notRepairedDamage 爲object類型其他都爲數字 這裏我們把他的幾個不同的值都進行顯示就知道了
Train_data['notRepairedDamage'].value_counts()
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
- 可以看出來‘ - ’也爲空缺值,因爲很多模型對nan有直接的處理,這裏我們先不做處理,先替換成nan
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 24324
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
特徵傾斜檢測
Train_data["seller"].value_counts()
0 149999
1 1
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0 150000
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
預測值分佈
Train_data['price']
0 1850
1 3600
2 6222
3 2400
4 5200
...
149995 5900
149996 9500
149997 7500
149998 4999
149999 4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500 2337
1500 2158
1200 1922
1000 1850
2500 1821
...
25321 1
8886 1
8801 1
37920 1
8188 1
Name: price, Length: 3763, dtype: int64
## 1) 總體分佈概況(無界約翰遜分佈等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ad9e69110>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-zC6ld8ar-1584885647204)(output_41_1.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-nLEqA09J-1584885647205)(output_41_2.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-kS7jNVAq-1584885647206)(output_41_3.png)]
- 價格不服從正態分佈,所以在進行迴歸之前,它必須進行轉換。雖然對數變換做得很好,但最佳擬合是無界約翰遜分佈
## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis: 18.995183
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-bEUJjMK9-1584885647206)(output_43_1.png)]
Train_data.skew(), Train_data.kurt()
(SaleID 6.017846e-17
name 5.576058e-01
regDate 2.849508e-02
model 1.484388e+00
brand 1.150760e+00
bodyType 9.915299e-01
fuelType 1.595486e+00
gearbox 1.317514e+00
power 6.586318e+01
kilometer -1.525921e+00
notRepairedDamage 2.430640e+00
regionCode 6.888812e-01
creatDate -7.901331e+01
price 3.346487e+00
v_0 -1.316712e+00
v_1 3.594543e-01
v_2 4.842556e+00
v_3 1.062920e-01
v_4 3.679890e-01
v_5 -4.737094e+00
v_6 3.680730e-01
v_7 5.130233e+00
v_8 2.046133e-01
v_9 4.195007e-01
v_10 2.522046e-02
v_11 3.029146e+00
v_12 3.653576e-01
v_13 2.679152e-01
v_14 -1.186355e+00
dtype: float64, SaleID -1.200000
name -1.039945
regDate -0.697308
model 1.740483
brand 1.076201
bodyType 0.206937
fuelType 5.880049
gearbox -0.264161
power 5733.451054
kilometer 1.141934
notRepairedDamage 3.908072
regionCode -0.340832
creatDate 6881.080328
price 18.995183
v_0 3.993841
v_1 -1.753017
v_2 23.860591
v_3 -0.418006
v_4 -0.197295
v_5 22.934081
v_6 -1.742567
v_7 25.845489
v_8 -0.636225
v_9 -0.321491
v_10 -0.577935
v_11 12.568731
v_12 0.268937
v_13 -0.438274
v_14 2.393526
dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ad8526510>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-5dij5RqL-1584885647207)(output_45_1.png)]
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ad83da590>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-U89djmGc-1584885647208)(output_46_1.png)]
skew、kurt說明參考https://www.cnblogs.com/wyy1480/p/10474046.html
## 3) 查看預測值的具體頻數
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-wvP15hTK-1584885647208)(output_48_0.png)]
查看頻數, 大於20000得值極少,其實這裏也可以把這些當作特殊得值(異常值)直接用填充或者刪掉,再前面進行
# log變換 z之後的分佈較均勻,可以進行log變換進行預測,這也是預測問題常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-mSQDYK0r-1584885647209)(output_50_0.png)]
數據類型分析
列
- name - 汽車編碼
- regDate - 汽車註冊時間
- model - 車型編碼
- brand - 品牌
- bodyType - 車身類型
- fuelType - 燃油類型
- gearbox - 變速箱
- power - 汽車功率
- kilometer - 汽車行駛公里
- notRepairedDamage - 汽車有尚未修復的損壞
- regionCode - 看車地區編碼
- seller - 銷售方 【以刪】
- offerType - 報價類型 【以刪】
- creatDate - 廣告發布時間
- price - 汽車價格
- v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根據汽車的評論、標籤等大量信息得到的embedding向量)【人工構造 匿名特徵】
# 分離label即預測值
Y_train = Train_data['price']
# 這個區別方式適用於沒有直接label coding的數據
# 這裏不適用,需要人爲根據實際含義來區分
# 數字特徵
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 類型特徵
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns
#數字特徵和分類特徵
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
# 特徵nunique分佈
for cat_fea in categorical_features:
print(cat_fea + "的特徵分佈如下:")
print("{}特徵有個{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
print(Train_data[cat_fea].value_counts())
name的特徵分佈如下:
name特徵有個99662不同的值
708 282
387 282
55 280
1541 263
203 233
...
5074 1
7123 1
11221 1
13270 1
174485 1
Name: name, Length: 99662, dtype: int64
model的特徵分佈如下:
model特徵有個248不同的值
0.0 11762
19.0 9573
4.0 8445
1.0 6038
29.0 5186
...
245.0 2
209.0 2
240.0 2
242.0 2
247.0 1
Name: model, Length: 248, dtype: int64
brand的特徵分佈如下:
brand特徵有個40不同的值
0 31480
4 16737
14 16089
10 14249
1 13794
6 10217
9 7306
5 4665
13 3817
11 2945
3 2461
7 2361
16 2223
8 2077
25 2064
27 2053
21 1547
15 1458
19 1388
20 1236
12 1109
22 1085
26 966
30 940
17 913
24 772
28 649
32 592
29 406
37 333
2 321
31 318
18 316
36 228
34 227
33 218
23 186
35 180
38 65
39 9
Name: brand, dtype: int64
bodyType的特徵分佈如下:
bodyType特徵有個8不同的值
0.0 41420
1.0 35272
2.0 30324
3.0 13491
4.0 9609
5.0 7607
6.0 6482
7.0 1289
Name: bodyType, dtype: int64
fuelType的特徵分佈如下:
fuelType特徵有個7不同的值
0.0 91656
1.0 46991
2.0 2212
3.0 262
4.0 118
5.0 45
6.0 36
Name: fuelType, dtype: int64
gearbox的特徵分佈如下:
gearbox特徵有個2不同的值
0.0 111623
1.0 32396
Name: gearbox, dtype: int64
notRepairedDamage的特徵分佈如下:
notRepairedDamage特徵有個2不同的值
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
regionCode的特徵分佈如下:
regionCode特徵有個7905不同的值
419 369
764 258
125 137
176 136
462 134
...
6414 1
7063 1
4239 1
5931 1
7267 1
Name: regionCode, Length: 7905, dtype: int64
# 特徵nunique分佈
for cat_fea in categorical_features:
print(cat_fea + "的特徵分佈如下:")
print("{}特徵有個{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
print(Test_data[cat_fea].value_counts())
name的特徵分佈如下:
name特徵有個37453不同的值
55 97
708 96
387 95
1541 88
713 74
..
22270 1
89855 1
42752 1
48899 1
11808 1
Name: name, Length: 37453, dtype: int64
model的特徵分佈如下:
model特徵有個247不同的值
0.0 3896
19.0 3245
4.0 3007
1.0 1981
29.0 1742
...
242.0 1
240.0 1
244.0 1
243.0 1
246.0 1
Name: model, Length: 247, dtype: int64
brand的特徵分佈如下:
brand特徵有個40不同的值
0 10348
4 5763
14 5314
10 4766
1 4532
6 3502
9 2423
5 1569
13 1245
11 919
7 795
3 773
16 771
8 704
25 695
27 650
21 544
15 511
20 450
19 450
12 389
22 363
30 324
17 317
26 303
24 268
28 225
32 193
29 117
31 115
18 106
2 104
37 92
34 77
33 76
36 67
23 62
35 53
38 23
39 2
Name: brand, dtype: int64
bodyType的特徵分佈如下:
bodyType特徵有個8不同的值
0.0 13985
1.0 11882
2.0 9900
3.0 4433
4.0 3303
5.0 2537
6.0 2116
7.0 431
Name: bodyType, dtype: int64
fuelType的特徵分佈如下:
fuelType特徵有個7不同的值
0.0 30656
1.0 15544
2.0 774
3.0 72
4.0 37
6.0 14
5.0 10
Name: fuelType, dtype: int64
gearbox的特徵分佈如下:
gearbox特徵有個2不同的值
0.0 37301
1.0 10789
Name: gearbox, dtype: int64
notRepairedDamage的特徵分佈如下:
notRepairedDamage特徵有個3不同的值
0.0 37249
- 8031
1.0 4720
Name: notRepairedDamage, dtype: int64
regionCode的特徵分佈如下:
regionCode特徵有個6971不同的值
419 146
764 78
188 52
125 51
759 51
...
7753 1
7463 1
7230 1
826 1
112 1
Name: regionCode, Length: 6971, dtype: int64
數字特徵分析
numeric_features.append('price')
numeric_features
['power',
'kilometer',
'v_0',
'v_1',
'v_2',
'v_3',
'v_4',
'v_5',
'v_6',
'v_7',
'v_8',
'v_9',
'v_10',
'v_11',
'v_12',
'v_13',
'v_14',
'price']
Train_data.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
5 rows × 29 columns
## 1) 相關性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price 1.000000
v_12 0.692823
v_8 0.685798
v_0 0.628397
power 0.219834
v_5 0.164317
v_2 0.085322
v_6 0.068970
v_1 0.060914
v_14 0.035911
v_13 -0.013993
v_7 -0.053024
v_4 -0.147085
v_9 -0.206205
v_10 -0.246175
v_11 -0.275320
kilometer -0.440519
v_3 -0.730946
Name: price, dtype: float64
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ad826aa10>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-bvDY9ToA-1584885647210)(output_63_1.png)]
del price_numeric['price']
## 2) 查看幾個特徵得 偏度和峯值
for col in numeric_features:
print('{:15}'.format(col),
'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
' ' ,
'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
)
power Skewness: 65.86 Kurtosis: 5733.45
kilometer Skewness: -1.53 Kurtosis: 001.14
v_0 Skewness: -1.32 Kurtosis: 003.99
v_1 Skewness: 00.36 Kurtosis: -01.75
v_2 Skewness: 04.84 Kurtosis: 023.86
v_3 Skewness: 00.11 Kurtosis: -00.42
v_4 Skewness: 00.37 Kurtosis: -00.20
v_5 Skewness: -4.74 Kurtosis: 022.93
v_6 Skewness: 00.37 Kurtosis: -01.74
v_7 Skewness: 05.13 Kurtosis: 025.85
v_8 Skewness: 00.20 Kurtosis: -00.64
v_9 Skewness: 00.42 Kurtosis: -00.32
v_10 Skewness: 00.03 Kurtosis: -00.58
v_11 Skewness: 03.03 Kurtosis: 012.57
v_12 Skewness: 00.37 Kurtosis: 000.27
v_13 Skewness: 00.27 Kurtosis: -00.44
v_14 Skewness: -1.19 Kurtosis: 002.39
price Skewness: 03.35 Kurtosis: 019.00
## 3) 每個數字特徵得分佈可視化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-by4pagHn-1584885647210)(output_66_0.png)]
可以看出匿名特徵相對分佈均勻
## 4) 數字特徵相互之間的關係可視化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Ylf7L29K-1584885647211)(output_68_0.png)]
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
dtype='object')
此處是多變量之間的關係可視化,可視化更多學習可參考很不錯的文章 https://www.jianshu.com/p/6e18d21a4cad
## 5) 多變量互相迴歸關係可視化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)
v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)
v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)
v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)
v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5acf0c9ad0>
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-zWyh16N5-1584885647212)(output_71_1.png)]
類別特徵分析
## 1) unique分佈
for fea in categorical_features:
print(Train_data[fea].nunique())
99662
248
40
8
7
2
2
7905
categorical_features
['name',
'model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage',
'regionCode']
## 2) 類別特徵箱形圖可視化
# 因爲 name和 regionCode的類別太稀疏了,這裏我們把不稀疏的幾類畫一下
categorical_features = ['model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage']
for c in categorical_features:
Train_data[c] = Train_data[c].astype('category')
if Train_data[c].isnull().any():#df.isnull().any()則會判斷哪些”列”存在缺失值
Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])#添加新的類別
Train_data[c] = Train_data[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")# 這個函數不知道什麼意思
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-lVfH0HBX-1584885647212)(output_75_0.png)]
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
dtype='object')
## 3) 類別特徵的小提琴圖可視化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
sns.violinplot(x=catg, y=target, data=Train_data)
plt.show()
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-XhN6yWh9-1584885647213)(output_77_0.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-h8kBD1q7-1584885647214)(output_77_1.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-37IwCCm6-1584885647214)(output_77_2.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-0Fy3KPHQ-1584885647215)(output_77_3.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-pc7W07IG-1584885647216)(output_77_4.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-1MuF3E3z-1584885647216)(output_77_5.png)]
categorical_features = ['model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage']
## 4) 類別特徵的柱形圖可視化
def bar_plot(x, y, **kwargs):
sns.barplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-4G4l2X05-1584885647217)(output_79_0.png)]
## 5) 類別特徵的每個類別頻數可視化(count_plot)
def count_plot(x, **kwargs):
sns.countplot(x=x)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-M4DZ4jf8-1584885647217)(output_80_0.png)]
用pandas_profiling生成數據報告
用pandas_profiling生成一個較爲全面的可視化和數據報告(較爲簡單、方便) 最終打開html文件即可
!pip install pandas_profiling
import pandas_profiling
Collecting pandas_profiling
[?25l Downloading https://files.pythonhosted.org/packages/e8/b0/bd5e3aaf37302fbe581b6947dc5ec1cda02a0ffe50fc823123def73e4d7a/pandas-profiling-2.5.0.tar.gz (192kB)
[K |████████████████████████████████| 194kB 498kB/s eta 0:00:01
[?25hRequirement already satisfied: numpy>=1.16.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas_profiling) (1.17.2)
Collecting scipy>=1.4.1 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1MB)
[K |████████████████████████████████| 26.1MB 9.8MB/s eta 0:00:011 |████████████ | 9.8MB 12.8MB/s eta 0:00:02 |█████████████████████▎ | 17.3MB 10.2MB/s eta 0:00:01
[?25hCollecting pandas==0.25.3 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/63/e0/a1b39cdcb2c391f087a1538bc8a6d62a82d0439693192aef541d7b123769/pandas-0.25.3-cp37-cp37m-manylinux1_x86_64.whl (10.4MB)
[K |████████████████████████████████| 10.4MB 11.2MB/s eta 0:00:01
[?25hRequirement already satisfied: matplotlib>=3.0.3 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas_profiling) (3.1.1)
Collecting confuse==1.0.0 (from pandas_profiling)
Downloading https://files.pythonhosted.org/packages/4c/6f/90e860cba937c174d8b3775729ccc6377eb91f52ad4eeb008e7252a3646d/confuse-1.0.0.tar.gz
Collecting jinja2==2.11.1 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/27/24/4f35961e5c669e96f6559760042a55b9bcfcdb82b9bdb3c8753dbe042e35/Jinja2-2.11.1-py2.py3-none-any.whl (126kB)
[K |████████████████████████████████| 133kB 12.2MB/s eta 0:00:01
[?25hCollecting visions==0.2.2 (from pandas_profiling)
Downloading https://files.pythonhosted.org/packages/07/4a/ab37f8bafda516b66c4f475b221a6c170097c0db203750a4aafb01023339/visions-0.2.2.tar.gz
Collecting htmlmin==0.1.12 (from pandas_profiling)
Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Requirement already satisfied: missingno==0.4.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas_profiling) (0.4.2)
Collecting phik==0.9.9 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/03/cf/b8cef2778104dc5d319f36dd836efaceb07a037cbf63f27c966b5a193ce9/phik-0.9.9-py3-none-any.whl (607kB)
[K |████████████████████████████████| 614kB 11.4MB/s eta 0:00:01
[?25hCollecting astropy>=3.2.3 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/d5/50/5195c71d9aed4b5673a820724dada56e21f65f717c68a0e29183b4e12375/astropy-4.0-cp37-cp37m-manylinux1_x86_64.whl (6.5MB)
[K |████████████████████████████████| 6.5MB 2.2MB/s eta 0:00:011
[?25hCollecting tangled-up-in-unicode==0.0.3 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/43/fc/e3c970c5007b405827a4623e70fc4eb966ee49bc1edbd56c33e85e1c6534/tangled_up_in_unicode-0.0.3.tar.gz (1.5MB)
[K |████████████████████████████████| 1.5MB 8.0MB/s eta 0:00:01
[?25hCollecting tqdm==4.42.0 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/cc/2e/4307206db63f05ed37e21d4c0d843d0fbcacd62479f8ce99ba0f2c0875e0/tqdm-4.42.0-py2.py3-none-any.whl (59kB)
[K |████████████████████████████████| 61kB 5.8MB/s eta 0:00:01
[?25hCollecting kaggle==1.5.6 (from pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/62/ab/bb20f9b9e24f9a6250f95a432f8d9a7d745f8d24039d7a5a6eaadb7783ba/kaggle-1.5.6.tar.gz (58kB)
[K |████████████████████████████████| 61kB 5.8MB/s eta 0:00:01
[?25hRequirement already satisfied: ipywidgets==7.5.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas_profiling) (7.5.1)
Requirement already satisfied: requests==2.22.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas_profiling) (2.22.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas==0.25.3->pandas_profiling) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from pandas==0.25.3->pandas_profiling) (2019.3)
Requirement already satisfied: cycler>=0.10 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib>=3.0.3->pandas_profiling) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib>=3.0.3->pandas_profiling) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from matplotlib>=3.0.3->pandas_profiling) (2.4.2)
Requirement already satisfied: pyyaml in /home/ach/anaconda3/lib/python3.7/site-packages (from confuse==1.0.0->pandas_profiling) (5.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /home/ach/anaconda3/lib/python3.7/site-packages (from jinja2==2.11.1->pandas_profiling) (1.1.1)
Requirement already satisfied: networkx in /home/ach/anaconda3/lib/python3.7/site-packages (from visions==0.2.2->pandas_profiling) (2.4)
Collecting attr (from visions==0.2.2->pandas_profiling)
Downloading https://files.pythonhosted.org/packages/de/be/ddc7f84d4e087144472a38a373d3e319f51a6faf6e5fc1ae897173675f21/attr-0.3.1.tar.gz
Requirement already satisfied: seaborn in /home/ach/anaconda3/lib/python3.7/site-packages (from missingno==0.4.2->pandas_profiling) (0.9.0)
Requirement already satisfied: pytest>=4.0.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from phik==0.9.9->pandas_profiling) (5.2.1)
Requirement already satisfied: jupyter-client>=5.2.3 in /home/ach/anaconda3/lib/python3.7/site-packages (from phik==0.9.9->pandas_profiling) (5.3.3)
Collecting pytest-pylint>=0.13.0 (from phik==0.9.9->pandas_profiling)
Downloading https://files.pythonhosted.org/packages/31/ef/e848f832a596a8a40b32d8aa169788b4df167c2d6a5960c01e83a30ebaa7/pytest_pylint-0.15.1-py3-none-any.whl
Requirement already satisfied: numba>=0.38.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from phik==0.9.9->pandas_profiling) (0.45.1)
Requirement already satisfied: nbconvert>=5.3.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from phik==0.9.9->pandas_profiling) (5.6.0)
Collecting joblib>=0.14.1 (from phik==0.9.9->pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294kB)
[K |████████████████████████████████| 296kB 13.4MB/s eta 0:00:01
[?25hRequirement already satisfied: urllib3<1.25,>=1.21.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from kaggle==1.5.6->pandas_profiling) (1.24.2)
Requirement already satisfied: six>=1.10 in /home/ach/anaconda3/lib/python3.7/site-packages (from kaggle==1.5.6->pandas_profiling) (1.12.0)
Requirement already satisfied: certifi in /home/ach/anaconda3/lib/python3.7/site-packages (from kaggle==1.5.6->pandas_profiling) (2019.11.28)
Collecting python-slugify (from kaggle==1.5.6->pandas_profiling)
Downloading https://files.pythonhosted.org/packages/92/5f/7b84a0bba8a0fdd50c046f8b57dcf179dc16237ad33446079b7c484de04c/python-slugify-4.0.0.tar.gz
Requirement already satisfied: nbformat>=4.2.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from ipywidgets==7.5.1->pandas_profiling) (4.4.0)
Requirement already satisfied: ipykernel>=4.5.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from ipywidgets==7.5.1->pandas_profiling) (5.1.2)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /home/ach/anaconda3/lib/python3.7/site-packages (from ipywidgets==7.5.1->pandas_profiling) (7.8.0)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from ipywidgets==7.5.1->pandas_profiling) (3.5.1)
Requirement already satisfied: traitlets>=4.3.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from ipywidgets==7.5.1->pandas_profiling) (4.3.3)
Requirement already satisfied: idna<2.9,>=2.5 in /home/ach/anaconda3/lib/python3.7/site-packages (from requests==2.22.0->pandas_profiling) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from requests==2.22.0->pandas_profiling) (3.0.4)
Requirement already satisfied: setuptools in /home/ach/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=3.0.3->pandas_profiling) (41.4.0)
Requirement already satisfied: decorator>=4.3.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from networkx->visions==0.2.2->pandas_profiling) (4.4.0)
Requirement already satisfied: py>=1.5.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (1.8.0)
Requirement already satisfied: packaging in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (19.2)
Requirement already satisfied: attrs>=17.4.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (19.2.0)
Requirement already satisfied: more-itertools>=4.0.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (7.2.0)
Requirement already satisfied: atomicwrites>=1.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (1.3.0)
Requirement already satisfied: pluggy<1.0,>=0.12 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (0.13.0)
Requirement already satisfied: wcwidth in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (0.1.7)
Requirement already satisfied: importlib-metadata>=0.12 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest>=4.0.2->phik==0.9.9->pandas_profiling) (0.23)
Requirement already satisfied: pyzmq>=13 in /home/ach/anaconda3/lib/python3.7/site-packages (from jupyter-client>=5.2.3->phik==0.9.9->pandas_profiling) (18.1.0)
Requirement already satisfied: jupyter-core in /home/ach/anaconda3/lib/python3.7/site-packages (from jupyter-client>=5.2.3->phik==0.9.9->pandas_profiling) (4.5.0)
Requirement already satisfied: tornado>=4.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from jupyter-client>=5.2.3->phik==0.9.9->pandas_profiling) (6.0.3)
Requirement already satisfied: pylint>=2.0.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling) (2.4.2)
Requirement already satisfied: llvmlite>=0.29.0dev0 in /home/ach/anaconda3/lib/python3.7/site-packages (from numba>=0.38.1->phik==0.9.9->pandas_profiling) (0.29.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (1.4.2)
Requirement already satisfied: bleach in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (3.1.0)
Requirement already satisfied: entrypoints>=0.2.2 in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (0.3)
Requirement already satisfied: pygments in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (2.4.2)
Requirement already satisfied: defusedxml in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (0.6.0)
Requirement already satisfied: testpath in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (0.4.2)
Requirement already satisfied: mistune<2,>=0.8.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (0.8.4)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.6->pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/a6/a5/c0b6468d3824fe3fde30dbb5e1f687b291608f9473681bbf7dabbf5a87d7/text_unidecode-1.3-py2.py3-none-any.whl (78kB)
[K |████████████████████████████████| 81kB 7.7MB/s eta 0:00:01
[?25hRequirement already satisfied: jsonschema!=2.5.0,>=2.4 in /home/ach/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets==7.5.1->pandas_profiling) (3.0.2)
Requirement already satisfied: ipython-genutils in /home/ach/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets==7.5.1->pandas_profiling) (0.2.0)
Requirement already satisfied: backcall in /home/ach/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (0.1.0)
Requirement already satisfied: pexpect; sys_platform != "win32" in /home/ach/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (4.7.0)
Requirement already satisfied: jedi>=0.10 in /home/ach/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (0.15.1)
Requirement already satisfied: pickleshare in /home/ach/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (0.7.5)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (2.0.10)
Requirement already satisfied: notebook>=4.4.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from widgetsnbextension~=3.5.0->ipywidgets==7.5.1->pandas_profiling) (6.0.1)
Requirement already satisfied: zipp>=0.5 in /home/ach/anaconda3/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest>=4.0.2->phik==0.9.9->pandas_profiling) (0.6.0)
Requirement already satisfied: astroid<2.4,>=2.3.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from pylint>=2.0.0->pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling) (2.3.1)
Requirement already satisfied: isort<5,>=4.2.5 in /home/ach/anaconda3/lib/python3.7/site-packages (from pylint>=2.0.0->pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling) (4.3.21)
Requirement already satisfied: mccabe<0.7,>=0.6 in /home/ach/anaconda3/lib/python3.7/site-packages (from pylint>=2.0.0->pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling) (0.6.1)
Requirement already satisfied: webencodings in /home/ach/anaconda3/lib/python3.7/site-packages (from bleach->nbconvert>=5.3.1->phik==0.9.9->pandas_profiling) (0.5.1)
Requirement already satisfied: pyrsistent>=0.14.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets==7.5.1->pandas_profiling) (0.15.4)
Requirement already satisfied: ptyprocess>=0.5 in /home/ach/anaconda3/lib/python3.7/site-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (0.6.0)
Requirement already satisfied: parso>=0.5.0 in /home/ach/anaconda3/lib/python3.7/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7.5.1->pandas_profiling) (0.5.1)
Requirement already satisfied: Send2Trash in /home/ach/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets==7.5.1->pandas_profiling) (1.5.0)
Requirement already satisfied: prometheus-client in /home/ach/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets==7.5.1->pandas_profiling) (0.7.1)
Requirement already satisfied: terminado>=0.8.1 in /home/ach/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets==7.5.1->pandas_profiling) (0.8.2)
Requirement already satisfied: wrapt==1.11.* in /home/ach/anaconda3/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint>=2.0.0->pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling) (1.11.2)
Requirement already satisfied: lazy-object-proxy==1.4.* in /home/ach/anaconda3/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint>=2.0.0->pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling) (1.4.2)
Collecting typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8" (from astroid<2.4,>=2.3.0->pylint>=2.0.0->pytest-pylint>=0.13.0->phik==0.9.9->pandas_profiling)
[?25l Downloading https://files.pythonhosted.org/packages/5d/10/0c1e8aa723a2b0c4032e048d8e511df82c8a1262f0e1df5e4c54eb2613e9/typed_ast-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (737kB)
[K |████████████████████████████████| 747kB 10.8MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pandas-profiling, confuse, visions, htmlmin, tangled-up-in-unicode, kaggle, attr, python-slugify
Building wheel for pandas-profiling (setup.py) ... [?25ldone
[?25h Created wheel for pandas-profiling: filename=pandas_profiling-2.5.0-py2.py3-none-any.whl size=241328 sha256=81645786541ffd480165a08412ae9302597a77a0b75080c2a306a4b9b5cf3ff0
Stored in directory: /home/ach/.cache/pip/wheels/9b/c9/f1/4a2f30c760e017f3e2f46be999c4597a93d126ef5ea38e276f
Building wheel for confuse (setup.py) ... [?25ldone
[?25h Created wheel for confuse: filename=confuse-1.0.0-cp37-none-any.whl size=17486 sha256=e026affd3264b9f5ec11bce170d12b9b2e7d35564dad4957966a484c2ee658c8
Stored in directory: /home/ach/.cache/pip/wheels/b0/b2/96/2074eee7dbf7b7df69d004c9b6ac4e32dad04fb7666cf943bd
Building wheel for visions (setup.py) ... [?25ldone
[?25h Created wheel for visions: filename=visions-0.2.2-cp37-none-any.whl size=53059 sha256=91b20c5dee9e8de062c249199d4ae8333bfc52bb0ac285b077cf2b579a4dbdc1
Stored in directory: /home/ach/.cache/pip/wheels/53/87/68/294a9e88d82e395b38571df18f7cb71e9ab51cedae77dd6f31
Building wheel for htmlmin (setup.py) ... [?25ldone
[?25h Created wheel for htmlmin: filename=htmlmin-0.1.12-cp37-none-any.whl size=27086 sha256=4f6462cc3df69d498e9cc4e82a6b6f95ce4260c877fa399c95fe5e58aac538dc
Stored in directory: /home/ach/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
Building wheel for tangled-up-in-unicode (setup.py) ... [?25ldone
[?25h Created wheel for tangled-up-in-unicode: filename=tangled_up_in_unicode-0.0.3-cp37-none-any.whl size=1554154 sha256=488f9b182f99de184f24acbede23c3815895eb9bf8ef82c9ce0bf54244dd88f3
Stored in directory: /home/ach/.cache/pip/wheels/c4/57/cc/5f58206efb00418d4dcae8d08a3cb40627778ea29622f664c6
Building wheel for kaggle (setup.py) ... [?25ldone
[?25h Created wheel for kaggle: filename=kaggle-1.5.6-cp37-none-any.whl size=72859 sha256=76f15a9764ed594edac2aece2fe5343cc97846e1b52f320ecac22264aaeecb90
Stored in directory: /home/ach/.cache/pip/wheels/57/4e/e8/bb28d035162fb8f17f8ca5d42c3230e284c6aa565b42b72674
Building wheel for attr (setup.py) ... [?25ldone
[?25h Created wheel for attr: filename=attr-0.3.1-cp37-none-any.whl size=2459 sha256=aacfe672fb79a445977dc333fc8c8e87c0c512178596bc7c1c7edb7e1a3f8287
Stored in directory: /home/ach/.cache/pip/wheels/f0/96/9b/1f8892a707d17095b5a6eab0275da9d39e68e03a26aee2e726
Building wheel for python-slugify (setup.py) ... [?25ldone
[?25h Created wheel for python-slugify: filename=python_slugify-4.0.0-py2.py3-none-any.whl size=5487 sha256=f2b8d0ae5d60817e937bf9a32908a299373aefcd2f80e18492dd929c1cd854fd
Stored in directory: /home/ach/.cache/pip/wheels/11/94/81/312969455540cb0e6a773e5d68a73c14128bfdfd4a7969bb4f
Successfully built pandas-profiling confuse visions htmlmin tangled-up-in-unicode kaggle attr python-slugify
[31mERROR: hyperopt 0.2.3 has requirement networkx==2.2, but you'll have networkx 2.4 which is incompatible.[0m
Installing collected packages: scipy, pandas, confuse, jinja2, tangled-up-in-unicode, attr, visions, htmlmin, pytest-pylint, joblib, phik, astropy, tqdm, text-unidecode, python-slugify, kaggle, pandas-profiling, typed-ast
Found existing installation: scipy 1.3.1
Uninstalling scipy-1.3.1:
Successfully uninstalled scipy-1.3.1
Found existing installation: pandas 0.25.1
Uninstalling pandas-0.25.1:
Successfully uninstalled pandas-0.25.1
Found existing installation: Jinja2 2.10.3
Uninstalling Jinja2-2.10.3:
Successfully uninstalled Jinja2-2.10.3
Found existing installation: joblib 0.13.2
Uninstalling joblib-0.13.2:
Successfully uninstalled joblib-0.13.2
Found existing installation: astropy 3.2.2
Uninstalling astropy-3.2.2:
Successfully uninstalled astropy-3.2.2
Found existing installation: tqdm 4.36.1
Uninstalling tqdm-4.36.1:
Successfully uninstalled tqdm-4.36.1
Successfully installed astropy-4.0 attr-0.3.1 confuse-1.0.0 htmlmin-0.1.12 jinja2-2.11.1 joblib-0.14.1 kaggle-1.5.6 pandas-0.25.3 pandas-profiling-2.5.0 phik-0.9.9 pytest-pylint-0.15.1 python-slugify-4.0.0 scipy-1.4.1 tangled-up-in-unicode-0.0.3 text-unidecode-1.3 tqdm-4.42.0 typed-ast-1.4.1 visions-0.2.2
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")
HBox(children=(FloatProgress(value=0.0, description='variables', max=29.0, style=ProgressStyle(descrip
所給出的EDA步驟爲廣爲普遍的步驟,在實際的不管是工程還是比賽過程中,這只是最開始的一步,也是最基本的一步。
接下來一般要結合模型的效果以及特徵工程等來分析數據的實際建模情況,根據自己的一些理解,查閱文獻,對實際問題做出判斷和深入的理解。
最後不斷進行EDA與數據處理和挖掘,來到達更好的數據結構和分佈以及較爲強勢相關的特徵
---
數據探索在機器學習中我們一般稱爲EDA(Exploratory Data Analysis):
> 是指對已有的數據(特別是調查或觀察得來的原始數據)在儘量少的先驗假定下進行探索,通過作圖、製表、方程擬合、計算特徵量等手段探索數據的結構和規律的一種數據分析方法。
數據探索有利於我們發現數據的一些特性,數據之間的關聯性,對於後續的特徵構建是很有幫助的。
1. 對於數據的初步分析(直接查看數據,或.sum(), .mean(),.descirbe()等統計函數)可以從:樣本數量,訓練集數量,是否有時間特徵,是否是時許問題,特徵所表示的含義(非匿名特徵),特徵類型(字符類似,int,float,time),特徵的缺失情況(注意缺失的在數據中的表現形式,有些是空的有些是”NAN”符號等),特徵的均值方差情況。
2. 分析記錄某些特徵值缺失佔比30%以上樣本的缺失處理,有助於後續的模型驗證和調節,分析特徵應該是填充(填充方式是什麼,均值填充,0填充,衆數填充等),還是捨去,還是先做樣本分類用不同的特徵模型去預測。
3. 對於異常值做專門的分析,分析特徵異常的label是否爲異常值(或者偏離均值較遠或者事特殊符號),異常值是否應該剔除,還是用正常值填充,是記錄異常,還是機器本身異常等。
4. 對於Label做專門的分析,分析標籤的分佈情況等。
5. 進步分析可以通過對特徵作圖,特徵和label聯合做圖(統計圖,離散圖),直觀瞭解特徵的分佈情況,通過這一步也可以發現數據之中的一些異常值等,通過箱型圖分析一些特徵值的偏離情況,對於特徵和特徵聯合作圖,對於特徵和label聯合作圖,分析其中的一些關聯性。