前言
提取了很多特徵,但是這些特徵那些是有效的呢?那些特徵是可以剔除的呢?基於此問題,本博客來討論一下。
PCA校驗
數據集可以自己弄來,可能需要歸一化。
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from pandas.core.frame import DataFrame
dataPath = "datalab/36395/"
dataFile = "dataProcessByTwoTask.csv"
data = pd.read_csv(dataPath + dataFile , encoding='utf-8')
print(data.shape)
print(data.head(2))
ab = np.arange(69)
print(ab)
X_train = data.iloc[ab]
pca_sk = PCA(n_components=50)
// 利用訓練特徵決定50個正交維度的方向,並轉化原訓練特徵
pca_X_train = pca_sk.fit_transform(X_train)
#pca_X_test = pca_sk.transform(X_train)
print("降維後數據集規模:{}".format(pca_X_train.shape))
print(pca_X_train[0:2])
運行 - 結果:
4423, 70)
take_amount_in_later_12_month_highest trans_amount_increase_rate_lately \
0 0 0.90
1 2000 1.28
transd_mcc trans_days_interval_filter trans_days_interval \
0 17.0 27.0 26.0
1 19.0 30.0 14.0
repayment_capability number_of_trans_from_2011 historical_trans_day \
0 19890 30.0 151.0
1 16970 23.0 224.0
rank_trad_1_month trans_amount_3_month ... \
0 0.40 34030 ...
1 0.35 10590 ...
loans_avg_limit consfin_credit_limit consfin_credibility \
0 1688.0 1200.0 75.0
1 1758.0 15100.0 80.0
consfin_org_count_current consfin_product_count consfin_max_limit \
0 1.0 2.0 1200.0
1 5.0 6.0 22800.0
consfin_avg_limit latest_query_day loans_latest_day \
0 1200.0 12.0 18.0
1 9360.0 4.0 2.0
first_transaction_time_day
0 17
1 2
[2 rows x 70 columns]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68]
降維後數據集規模:(69, 50)
[[ -1.07522268e+05 -6.30441595e+03 2.82432053e+03 -1.93347881e+04
-1.65625183e+03 -8.86291024e+03 4.69082780e+02 -5.41692974e+02
9.59855188e+02 -1.17498610e+03 -8.27317923e+02 7.44677439e+02
-2.95320251e+02 1.00262713e+03 2.88593126e+02 -1.67027282e+02
1.00704755e+01 -5.51748767e+00 -2.64264010e+01 1.03976939e+01
2.41833376e+00 -8.67013872e+00 -1.39779453e+01 1.55178837e+00
-6.08964397e+00 -4.07650452e+00 -1.59907843e+01 4.33065850e+00
7.00473045e+00 -1.01812729e+01 -1.90941839e+00 -3.02104040e+00
-6.30292531e+00 7.82081815e+00 -4.50838859e-01 -2.44343401e-01
-2.33156766e+00 -1.55272399e+00 6.03854544e-01 -4.44479902e-01
-3.94035823e+00 -1.42114318e+00 -1.58559025e+00 3.19432485e+00
3.16414591e+00 -1.11029179e+00 1.89054341e-01 3.52892846e+00
-2.48207266e-01 -3.74139917e-02]
[ -1.07287429e+05 -1.15331483e+04 -1.47156042e+04 9.74568582e+03
-8.83707652e+03 -1.80063437e+03 -2.02682915e+03 2.20253946e+03
2.18952762e+03 5.81271155e+02 7.46027569e+02 9.23489176e+02
1.74562840e+02 -9.05207703e+01 7.89127100e+00 3.50719339e+01
8.87024342e+01 1.07562984e+02 -3.73552973e+01 -4.56951209e+00
1.41880195e+01 1.64441214e+01 -8.28721202e-01 -3.01259284e+00
-4.83479942e+00 2.43014026e+00 -1.37180875e+01 -6.89473596e+00
5.67154532e+00 1.29839303e+00 -6.05073773e+00 -1.66655939e+00
1.22784208e+01 -2.05540945e+00 1.63740245e+00 -1.60083170e-01
3.87686452e+00 3.48324574e+00 1.33216991e+00 5.51007857e-01
-8.96974992e-01 -5.30343056e+00 4.21107917e+00 -1.04052498e+00
-1.93495138e+00 -2.78269936e+00 -1.00751398e+00 1.55444165e+00
2.30216194e-01 -6.47515436e-01]]
PCA優勢與劣勢
PCA 主成分分分析,跟特徵選取是有區別的,它是竟可能的保留原特徵的方差,採取生成新特徵的方法,所以用PCA是看不到降維後的具體特徵名字,用於減少數據集的維度,同時保持數據集中使方差貢獻最大的特徵。改變了原來特徵的形式。
-
PCA算法的主要優點有:
1)僅僅需要以方差衡量信息量,不受數據集以外的因素影響。
2)各主成分之間正交,可消除原始數據成分間的相互影響的因素。
3)計算方法簡單,主要運算是特徵值分解,易於實現。 -
PCA算法的主要缺點有:
1)主成分各個特徵維度的含義具有一定的模糊性,不如原始樣本特徵的解釋性強。
2)方差小的非主成分也可能含有對樣本差異的重要信息,因降維丟棄可能對後續數據處理有影響。