汽車保險客戶分類問題

代碼：https://www.kaggle.com/manibhask/cleaning-visualizing-and-modeling-cold-call-data
數據：https://www.kaggle.com/kondla/carinsurance

讓我們查看數據集的特徵並瞭解每個屬性/特徵的含義。下表顯示了數據集的簡要說明以及變量是連續的，分類的還是離散的。

Feature	Description	Example
Id	唯一標識	“1” … “5000”
Age	客戶年齡
Job	客戶的工作	“admin.”, “blue-collar”, etc.
Marital	客戶的婚姻狀態	“divorced”, “married”, “single”
Education	客戶的學歷層次	“primary”, “secondary”, etc.
Default	是否有過信用違約	“yes” - 1,“no” - 0
Balance	年平均餘額（美元）
HHInsurance	是否有家庭保險	“yes” - 1,“no” - 0
CarLoan	是否有汽車貸款	“yes” - 1,“no” - 0
Communication	聯繫人通訊類型	“cellular”, “telephone”, “NA”
LastContactMonth	上次聯繫在哪一月	“jan”, “feb”, etc.
LastContactDay	上次聯繫在哪一天
CallStart	上次通話的開始時間 (HH:MM:SS)	12:43:15
CallEnd	上次通話的結束時間 (HH:MM:SS)	12:43:15
NoOfContacts	在此廣告系列中爲此客戶執行的聯繫數量
DaysPassed	上次聯繫客戶後經過的天數, -1表示還沒有聯繫過
PrevAttempts	此廣告系列之前爲此客戶執行的聯繫數量
Outcome	先前營銷活動的結果	“failure”, “other”, “success”, “NA”
CarInsurance	客戶是否購買汽車保險	“yes” - 1,“no” - 0

數據整理

數據整理是將數據從一種形式轉換爲另一種形式以更好地理解它的過程。在本例中，我們的數據以CSV文件的形式提供給我們，讓我們使用功能強大的python數據科學庫將其加載到數據框中。好吧，我從未想過它看起來會如此簡單！

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
%matplotlib inline
from sklearn.model_selection import train_test_split,cross_val_score,KFold,cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score,confusion_matrix,precision_recall_curve,roc_curve
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier,RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.neighbors  import KNeighborsClassifier
from sklearn import svm,tree

df = pd.read_csv('../data/carInsurance_train.csv',index_col = 'Id')

df.head()

	Age	Job	Marital	Education	Default	Balance	HHInsurance	CarLoan	Communication	LastContactDay	LastContactMonth	NoOfContacts	DaysPassed	PrevAttempts	Outcome	CallStart	CallEnd	CarInsurance
Id
1	32	management	single	tertiary	0	1218	1	0	telephone	28	jan	2	-1	0	NaN	13:45:20	13:46:30	0
2	32	blue-collar	married	primary	0	1156	1	0	NaN	26	may	5	-1	0	NaN	14:49:03	14:52:08	0
3	29	management	single	tertiary	0	637	1	0	cellular	3	jun	1	119	1	failure	16:30:24	16:36:04	1
4	25	student	single	primary	0	373	1	0	cellular	11	may	2	-1	0	NaN	12:06:43	12:20:22	1
5	30	management	married	tertiary	0	2694	0	0	cellular	3	jun	1	-1	0	NaN	14:35:44	14:38:56	0

df.shape

(4000, 18)

df.columns

Index(['Age', 'Job', 'Marital', 'Education', 'Default', 'Balance',
       'HHInsurance', 'CarLoan', 'Communication', 'LastContactDay',
       'LastContactMonth', 'NoOfContacts', 'DaysPassed', 'PrevAttempts',
       'Outcome', 'CallStart', 'CallEnd', 'CarInsurance'],
      dtype='object')

df.describe()

	Age	Default	Balance	HHInsurance	CarLoan	LastContactDay	NoOfContacts	DaysPassed	PrevAttempts	CarInsurance
count	4000.000000	4000.000000	4000.000000	4000.00000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000
mean	41.214750	0.014500	1532.937250	0.49275	0.133000	15.721250	2.607250	48.706500	0.717500	0.401000
std	11.550194	0.119555	3511.452489	0.50001	0.339617	8.425307	3.064204	106.685385	2.078647	0.490162
min	18.000000	0.000000	-3058.000000	0.00000	0.000000	1.000000	1.000000	-1.000000	0.000000	0.000000
25%	32.000000	0.000000	111.000000	0.00000	0.000000	8.000000	1.000000	-1.000000	0.000000	0.000000
50%	39.000000	0.000000	551.500000	0.00000	0.000000	16.000000	2.000000	-1.000000	0.000000	0.000000
75%	49.000000	0.000000	1619.000000	1.00000	0.000000	22.000000	3.000000	-1.000000	0.000000	1.000000
max	95.000000	1.000000	98417.000000	1.00000	1.000000	31.000000	43.000000	854.000000	58.000000	1.000000

df.dtypes

Age                  int64
Job                 object
Marital             object
Education           object
Default              int64
Balance              int64
HHInsurance          int64
CarLoan              int64
Communication       object
LastContactDay       int64
LastContactMonth    object
NoOfContacts         int64
DaysPassed           int64
PrevAttempts         int64
Outcome             object
CallStart           object
CallEnd             object
CarInsurance         int64
dtype: object

描述非數值型變量的特點，這裏主要有計數，類別總數，頻數最多的類別及對應頻數。

df.describe(include=['O'])

	Job	Marital	Education	Communication	LastContactMonth	Outcome	CallStart	CallEnd
count	3981	4000	3831	3098	4000	958	4000	4000
unique	11	3	3	2	12	3	3777	3764
top	management	married	secondary	cellular	may	failure	15:27:56	10:22:30
freq	893	2304	1988	2831	1049	437	3	3

離羣值分析

https://blog.csdn.net/weixin_42056745/article/details/90516835

https://blog.csdn.net/yuxeaotao/article/details/79876377

從箱線圖可以發現，數值範圍比較大，離羣值比較多，但都表現連起來了，但最大值已經超過其他值太多，所以需要刪除，防止過擬合。

sns.boxplot(x='Balance',data=df,palette='hls');

df.Balance.max()

df[df['Balance'] == 98417]

	Age	Job	Marital	Education	Default	Balance	HHInsurance	CarLoan	Communication	LastContactDay	LastContactMonth	NoOfContacts	DaysPassed	PrevAttempts	Outcome	CallStart	CallEnd	CarInsurance
Id
1743	59	management	married	tertiary	0	98417	0	0	telephone	20	nov	5	-1	0	NaN	10:51:42	10:54:07	0

df.index[1742]

#刪除異常值對應的索引值
df_new = df.drop(df.index[1742]);

處理缺失值

缺失值是數據分析的主要問題，處理它們是另一個障礙。 Python將丟失的數據視爲NaN，但不將其包括在計算和可視化中。同樣，如果不處理缺失值就無法建立預測模型。在我們的情況下，缺失值主要發生在Outcome和Communication字段中。 Job和Education也具有一定量的缺失值。

像Job和Education這樣的缺失值非常少，可以使用python中的backfill / frontfill pad方法估算。結果和Communication缺失值很多，因此對於NaN值使用None估算。

fillna： https://blog.csdn.net/weixin_39549734/article/details/81221276

df_new.isnull().sum()

Age                    0
Job                   19
Marital                0
Education            169
Default                0
Balance                0
HHInsurance            0
CarLoan                0
Communication        902
LastContactDay         0
LastContactMonth       0
NoOfContacts           0
DaysPassed             0
PrevAttempts           0
Outcome             3041
CallStart              0
CallEnd                0
CarInsurance           0
dtype: int64

#method ='pad'用前一個非缺失值去填充該缺失值

df_new['Job'] = df_new['Job'].fillna(method ='pad')
df_new['Education'] = df_new['Education'].fillna(method ='pad')

df_new['Communication'] = df_new['Communication'].fillna('none')
df_new['Outcome'] = df_new['Outcome'].fillna('none')

df_new['Outcome'].value_counts()

none       3041
failure     437
success     326
other       195
Name: Outcome, dtype: int64

將Outcome字段的缺失值填充爲none，none的頻數顯示也剛好爲3041

df_new.isnull().sum()

Age                 0
Job                 0
Marital             0
Education           0
Default             0
Balance             0
HHInsurance         0
CarLoan             0
Communication       0
LastContactDay      0
LastContactMonth    0
NoOfContacts        0
DaysPassed          0
PrevAttempts        0
Outcome             0
CallStart           0
CallEnd             0
CarInsurance        0
dtype: int64

可視化

可視化是數據科學的一個重要方面，沒有它就很難輕易地得出結果。儘管結果在表中是確定的，但是查看細節並得出結論是一個痛點。圖表/圖形對非技術人員輕鬆完成這些任務非常有幫助。高管人員和經理們喜歡以可視化的方式查看報告，以便他們可以輕鬆地制定複雜的決策。下面是一個配對圖，可以將感興趣的字段配對並繪製出來。 Pairplot的變量是從熱圖中選擇的，這些變量會影響結果

** Pairplot的關鍵要點**

*30-60歲更有可能購買汽車保險【（1,1）圖】。
*有汽車貸款和買過家庭保險的人購買的可能性較小一些【（3,3）（4,4）位置的雙峯圖】。
*如果過去的天數（聯繫他們之前的時間）增加，則人們會給出正號【（7,6）圖】。
*當經常與他人聯繫時，他們的購買傾向會在20多次接觸後減大幅減少【最後一排圖】。
*在此廣告系列中爲此客戶執行的聯繫數量越多效果越好，即增加了汽車保險的購買【（7,5）圖】。

df_sub = ['Age','Balance','HHInsurance', 'CarLoan','NoOfContacts','DaysPassed','PrevAttempts','CarInsurance']  #這裏都是數值變量
sns.pairplot(df_new[df_sub],hue='CarInsurance',size=1.5);   #注意這裏df_sub包含因變量CarInsurance

PairGrid幫助我們查看了CarInsurance，Balance和分類變量（如Education，Marital和Job）之間的關係。學生和退休人員購買的汽車保險最多【（1,3）圖】，單身身份和受過高等教育的人也更傾向購買汽車保險【（1,1），（1,2）圖】。下面一層圖可以觀察哪些人的年平均餘額比較多。CarInsurance的範圍是[0,1],反映了購買保險比例，而Balance反映了一個平均值水平。

g = sns.PairGrid(df_new,
                 x_vars=["Education","Marital", "Job"],
                 y_vars=["CarInsurance", "Balance"],
                 aspect=.75, size=6)
plt.xticks(rotation=90)
g.map(sns.barplot, palette="pastel");

小提琴圖在y軸處的凸出值接近1，表明3月，9月，10月和12月是人們購買汽車保險的理想月份。

sns.violinplot(x="LastContactMonth",y='CarInsurance',data=df_new);

sns.countplot(x="Outcome",hue='CarInsurance',data=df_new);

特徵工程

特徵工程是機器學習問題的基本要素。在我們的問題中，有一系列連續變量，例如Age和Balance，需要將它們進行裝箱。使用四分位數剪切功能將“年齡”和“平衡”連續變量分類爲5個部分。

pd.qcut：https://blog.csdn.net/starter_____/article/details/79327997

#qcut將兩個屬性按頻數均分成5個區間，值爲0,1,2,3,4
df_new['AgeBinned'] = pd.qcut(df_new['Age'], 5 , labels = False)
df_new['BalanceBinned'] = pd.qcut(df_new['Balance'], 5,labels = False)

關於CallStart和CallEnd屬性似乎存在一個獨特的問題，它們記錄爲可以使用datetime函數輕鬆計算的對象變量，因此將其轉換爲datetime函數並減去它們會得出實際的CallTime，可以對其進一步進行分箱如上。

#將CallStart和CallEnd轉換爲datetime數據類型
df_new['CallStart'] = pd.to_datetime(df_new['CallStart'] )
df_new['CallEnd'] = pd.to_datetime(df_new['CallEnd'] )

#結束時間-開始時間以得出實際的通話時間
df_new['CallTime'] = (df_new['CallEnd'] - df_new['CallStart']).dt.total_seconds()

#分組
df_new['CallTimeBinned'] = pd.qcut(df_new['CallTime'], 5,labels = False)

#刪除被合併的原始列，爲了使變量看起來更簡潔
df_new.drop(['Age','Balance','CallStart','CallEnd','CallTime'],axis = 1,inplace = True)

分類變量也可以參與模型構建，前提是它們必須獲得其虛擬值才能被包括在內。通過此過程，我們將在數據框中包含更多列。

get_dummies用法：https://blog.csdn.net/maymay_/article/details/80198468

#使用get_dummies函數將二進制值分配給分類列中的每個值
Job = pd.get_dummies(data = df_new['Job'],prefix = "Job")
Marital= pd.get_dummies(data = df_new['Marital'],prefix = "Marital")
Education= pd.get_dummies(data = df_new['Education'],prefix="Education")
Communication = pd.get_dummies(data = df_new['Communication'],prefix = "Communication")
LastContactMonth = pd.get_dummies(data = df_new['LastContactMonth'],prefix= "LastContactMonth")
Outcome = pd.get_dummies(data = df_new['Outcome'],prefix = "Outcome")

#刪除已分配了虛擬變量的類別列
df_new.drop(['Job','Marital','Education','Communication','LastContactMonth','Outcome'],axis=1,inplace=True)

#合併需要用到的所有列
df = pd.concat([df_new,Job,Marital,Education,Communication,LastContactMonth,Outcome],axis=1)

df.columns

Index(['Default', 'HHInsurance', 'CarLoan', 'LastContactDay', 'NoOfContacts',
       'DaysPassed', 'PrevAttempts', 'CarInsurance', 'AgeBinned',
       'BalanceBinned', 'CallTimeBinned', 'Job_admin.', 'Job_blue-collar',
       'Job_entrepreneur', 'Job_housemaid', 'Job_management', 'Job_retired',
       'Job_self-employed', 'Job_services', 'Job_student', 'Job_technician',
       'Job_unemployed', 'Marital_divorced', 'Marital_married',
       'Marital_single', 'Education_primary', 'Education_secondary',
       'Education_tertiary', 'Communication_cellular', 'Communication_none',
       'Communication_telephone', 'LastContactMonth_apr',
       'LastContactMonth_aug', 'LastContactMonth_dec', 'LastContactMonth_feb',
       'LastContactMonth_jan', 'LastContactMonth_jul', 'LastContactMonth_jun',
       'LastContactMonth_mar', 'LastContactMonth_may', 'LastContactMonth_nov',
       'LastContactMonth_oct', 'LastContactMonth_sep', 'Outcome_failure',
       'Outcome_none', 'Outcome_other', 'Outcome_success'],
      dtype='object')

通常通過在已知輸出（標記的數據）上對模型進行訓練來對模型進行評估，以使模型可以從中學習，並使用未標記的數據進行測試，從而可以確定模型的預測準確性，從而進行訓練測試拆分。

train_test_split：https://blog.csdn.net/Lynn_mg/article/details/83062630

X= df.drop(['CarInsurance'],axis=1).values
y=df['CarInsurance'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=42, stratify = y) #將stratify=y就是按照y中的比例分配

預測模型的建立和驗證

**預測模型** sklearn中集成了很多分類預測算法，在我們的案例中，我們利用了與問題相關的大多數分類算法。我們的分類器包括 1. kNN 2. Logistic Regression 3. SVM 4. Decision Tree 5. Random Forest 6. AdaBoost 7. XGBoost **交叉驗證**

交叉驗證用於將數據分爲訓練集和測試集，以評估模型的性能。在KFold中，K確定要在數據上進行劃分的數目，並從中使用1個樣本進行訓練，而在我們的案例中，將10-1作爲樣本用於驗證。每個模型的交叉驗證得分是通過將模型分爲10折來評估的。

最好的模型是** Random Forest 和 XGBoost **，它們都以良好的準確性得分很好地完成了自己的任務。

#以下矩陣的代碼來自sklearn文檔
#定義混淆矩陣函數
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    
   
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    


    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
        
        
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

class_names = ['Success','Failure']

knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train,y_train)
print ("kNN Accuracy is %2.2f" % accuracy_score(y_test, knn.predict(X_test)))

#10折交叉驗證
score_knn = cross_val_score(knn, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_knn)
y_pred= knn.predict(X_test)
print(classification_report(y_test, y_pred))


cm = confusion_matrix(y_test,y_pred)
#畫混淆矩陣
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

kNN Accuracy is 0.76
Cross Validation Score = 0.75
              precision    recall  f1-score   support

           0       0.75      0.90      0.82       479
           1       0.78      0.55      0.65       321

    accuracy                           0.76       800
   macro avg       0.76      0.72      0.73       800
weighted avg       0.76      0.76      0.75       800

#Logistic Regression Classifier
LR = LogisticRegression()
LR.fit(X_train,y_train)
print ("Logistic Accuracy is %2.2f" % accuracy_score(y_test, LR.predict(X_test)))
score_LR = cross_val_score(LR, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_LR)
y_pred = LR.predict(X_test)
print(classification_report(y_test, y_pred))
# Confusion matrix for LR
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

Logistic Accuracy is 0.83
Cross Validation Score = 0.81
              precision    recall  f1-score   support

           0       0.85      0.87      0.86       479
           1       0.80      0.78      0.79       321

    accuracy                           0.83       800
   macro avg       0.82      0.82      0.82       800
weighted avg       0.83      0.83      0.83       800

SVM = svm.SVC()
SVM.fit(X_train, y_train)
print ("SVM Accuracy is %2.2f" % accuracy_score(y_test, SVM.predict(X_test)))
score_svm = cross_val_score(SVM, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_svm)
y_pred = SVM.predict(X_test)
print(classification_report(y_test,y_pred))
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

SVM Accuracy is 0.67
Cross Validation Score = 0.66
              precision    recall  f1-score   support

           0       0.66      0.91      0.77       479
           1       0.70      0.31      0.43       321

    accuracy                           0.67       800
   macro avg       0.68      0.61      0.60       800
weighted avg       0.68      0.67      0.63       800

# Decision Tree Classifier
DT = tree.DecisionTreeClassifier(random_state = 0,class_weight="balanced",
    min_weight_fraction_leaf=0.01)
DT = DT.fit(X_train,y_train)
print ("Decision Tree Accuracy is %2.2f" % accuracy_score(y_test, DT.predict(X_test)))
score_DT = cross_val_score(DT, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_DT)
y_pred = DT.predict(X_test)
print(classification_report(y_test, y_pred))
# Confusion Matrix for Decision Tree
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

Decision Tree Accuracy is 0.82
Cross Validation Score = 0.81
              precision    recall  f1-score   support

           0       0.88      0.81      0.84       479
           1       0.74      0.83      0.79       321

    accuracy                           0.82       800
   macro avg       0.81      0.82      0.81       800
weighted avg       0.82      0.82      0.82       800

#Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=10,class_weight="balanced")
rfc.fit(X_train, y_train)
print ("Random Forest Accuracy is %2.2f" % accuracy_score(y_test, rfc.predict(X_test)))
score_rfc = cross_val_score(rfc, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_rfc)
y_pred = rfc.predict(X_test)
print(classification_report(y_test,y_pred ))
#Confusion Matrix for Random Forest
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

Random Forest Accuracy is 0.86
Cross Validation Score = 0.84
              precision    recall  f1-score   support

           0       0.90      0.86      0.88       479
           1       0.80      0.86      0.83       321

    accuracy                           0.86       800
   macro avg       0.85      0.86      0.85       800
weighted avg       0.86      0.86      0.86       800

#AdaBoost Classifier
ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
ada.fit(X_train,y_train)
print ("AdaBoost Accuracy= %2.2f" % accuracy_score(y_test,ada.predict(X_test)))
score_ada = cross_val_score(ada, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_ada)
y_pred = ada.predict(X_test)
print(classification_report(y_test,y_pred ))
#Confusion Marix for AdaBoost
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

AdaBoost Accuracy= 0.83
Cross Validation Score = 0.82
              precision    recall  f1-score   support

           0       0.83      0.90      0.86       479
           1       0.82      0.73      0.77       321

    accuracy                           0.83       800
   macro avg       0.83      0.81      0.82       800
weighted avg       0.83      0.83      0.83       800

#XGBoost Classifier
xgb = GradientBoostingClassifier(n_estimators=1000,learning_rate=0.01)
xgb.fit(X_train,y_train)
print ("GradientBoost Accuracy= %2.2f" % accuracy_score(y_test,xgb.predict(X_test)))
score_xgb = cross_val_score(xgb, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_ada)
y_pred = xgb.predict(X_test) 
print(classification_report(y_test,y_pred))
#Confusion Matrix for XGBoost Classifier
cm_xg = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm_xg, classes=class_names, title='Confusion matrix')

GradientBoost Accuracy= 0.85
Cross Validation Score = 0.82
              precision    recall  f1-score   support

           0       0.87      0.89      0.88       479
           1       0.82      0.79      0.81       321

    accuracy                           0.85       800
   macro avg       0.84      0.84      0.84       800
weighted avg       0.85      0.85      0.85       800

ROC曲線

ROC繪製了所有模型，並向左上方繪製了Gradient Boosting（XGBoost）和Randomforest的對應曲線，表明這些預測器模型是最好的

ROC： https://blog.csdn.net/kMD8d5R/article/details/98552574?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-4.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-4.nonecase

#Obtaining False Positive Rate, True Positive Rate and Threshold for all classifiers
fpr, tpr, thresholds = roc_curve(y_test, knn.predict_proba(X_test)[:,1])
LR_fpr, LR_tpr, thresholds = roc_curve(y_test, LR.predict_proba(X_test)[:,1])
#SVM_fpr, SVM_tpr, thresholds = roc_curve(y_test, SVM.predict_proba(X_test)[:,1])
DT_fpr, DT_tpr, thresholds = roc_curve(y_test, DT.predict_proba(X_test)[:,1])
rfc_fpr, rfc_tpr, thresholds = roc_curve(y_test, rfc.predict_proba(X_test)[:,1])
ada_fpr, ada_tpr, thresholds = roc_curve(y_test, ada.predict_proba(X_test)[:,1])
xgb_fpr, xgb_tpr, thresholds = roc_curve(y_test, xgb.predict_proba(X_test)[:,1])
#PLotting ROC Curves for all classifiers
plt.plot(fpr, tpr, label='KNN' )
plt.plot(LR_fpr, LR_tpr, label='Logistic Regression')
#plt.plot(SVM_fpr, SVM_tpr, label='SVM')
plt.plot(DT_fpr, DT_tpr, label='Decision Tree')
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest')
plt.plot(ada_fpr, ada_tpr, label='AdaBoost')
plt.plot(xgb_fpr, xgb_tpr, label='GradientBoosting')
# Plot Base Rate ROC
plt.plot([0,1],[0,1],label='Base Rate')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

特徵重要性

重要特徵識別是通過使用諸如Logistic迴歸和決策樹之類的模型完成的。兩者在識別特徵時都非常清晰。下圖顯示了ExtraTreesClassifier確定的最重要變量，而前10個變量是

CallTime
LastContactDay
Balance
NoofContacts
Outcome_success
Age
HHInsurance
Communication_none
Dayspassed
Outcome_none

#使用遞歸特徵消除函數並將其擬合到Logistic迴歸模型中
modell = LogisticRegression()
rfe = RFE(modell, 5)
rfe = rfe.fit(X_train,y_train)
# 顯示變量等級排序
rfe.ranking_

array([10,  9, 16, 41, 34, 42, 32, 36, 33,  3, 24, 15, 17, 27, 28, 20, 22,
       29,  2, 40, 26, 38, 21, 37, 31, 30, 25, 19,  1, 18, 39,  7, 11, 35,
        4,  5, 23,  1,  6,  8,  1,  1, 13, 12, 14,  1])

#使用ExtraTreesClassifier模型函數
model = ExtraTreesClassifier()
model.fit(X_train, y_train)


print(model.feature_importances_)
importances = model.feature_importances_
feat_names = df.drop(['CarInsurance'],axis=1).columns

#通過按重要性順序對功能重要性進行排序將其顯示爲圖表
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,6))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color='lightblue',  align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()

[0.00268756 0.0313893  0.01658237 0.06520325 0.04730283 0.01636741
 0.01347456 0.04461972 0.04937946 0.25765298 0.01216092 0.01304826
 0.00579582 0.00493788 0.01324779 0.0097079  0.00640912 0.0091661
 0.00601809 0.01457932 0.00655431 0.01135298 0.01605904 0.01331409
 0.00989652 0.01628276 0.01403667 0.01703035 0.02387476 0.00606056
 0.01792573 0.01598123 0.00329203 0.0090054  0.00783392 0.01389945
 0.01580459 0.01121934 0.01654914 0.00965812 0.01079382 0.00951623
 0.00932224 0.02123694 0.00591377 0.04785536]

rfc = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=10,class_weight="balanced")
y_proba = cross_val_predict(rfc, X, y, cv=10, n_jobs=-1, method='predict_proba')
results = pd.DataFrame({'y': y, 'y_proba': y_proba[:,1]})
results = results.sort_values(by='y_proba', ascending=False).reset_index(drop=True)
results.index = results.index + 1
results.index = results.index / len(results.index) * 100

sns.set_style('darkgrid')
pred = results
pred['Lift Curve'] = pred.y.cumsum() / pred.y.sum() * 100
pred['Baseline'] = pred.index
base_rate = y.sum() / len(y) * 100
pred[['Lift Curve', 'Baseline']].plot(style=['-', '--', '--'])
pd.Series(data=[0, 100, 100], index=[0, base_rate, 100]).plot(style='--')
plt.title('Cumulative Gains')
plt.xlabel('% of Customers Contacted')
plt.ylabel("% of Positive Results")
plt.legend(['Lift Curve', 'Baseline', 'Ideal']);

提出建議

** 1。培訓在呼叫中心工作的員工的人際交往能力，使他們在通話中變得更加友好和參與**

** 2。保持跟蹤器的作用，以提醒後續行動，以便代表可以再次與該人交談並嘗試說服
他們購買汽車保險**

** 3。選擇具有良好信用評分和帳戶餘額的人，以便在他們身上花費的時間是有用的**

** 4。專注於40歲以上的老年人，因爲根據以前的數據，很容易折衷爲新計劃**

** 5。上一個廣告系列中的聯絡人做出了迴應，因爲他們更有可能購買保險**

汽車保險客戶分類問題

數據整理

離羣值分析

處理缺失值

相關性

可視化

特徵工程

預測模型的建立和驗證

ROC曲線

提出建議

k-means用戶劃分

office各種插件

汽車保險客戶分類問題

python 進行文本情感分析

Hive 窗口函數over()

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結