代碼:https://www.kaggle.com/manibhask/cleaning-visualizing-and-modeling-cold-call-data
數據:https://www.kaggle.com/kondla/carinsurance
讓我們查看數據集的特徵並瞭解每個屬性/特徵的含義。下表顯示了數據集的簡要說明以及變量是連續的,分類的還是離散的。
Feature | Description | Example |
---|---|---|
Id | 唯一標識 | “1” … “5000” |
Age | 客戶年齡 | |
Job | 客戶的工作 | “admin.”, “blue-collar”, etc. |
Marital | 客戶的婚姻狀態 | “divorced”, “married”, “single” |
Education | 客戶的學歷層次 | “primary”, “secondary”, etc. |
Default | 是否有過信用違約 | “yes” - 1,“no” - 0 |
Balance | 年平均餘額(美元) | |
HHInsurance | 是否有家庭保險 | “yes” - 1,“no” - 0 |
CarLoan | 是否有汽車貸款 | “yes” - 1,“no” - 0 |
Communication | 聯繫人通訊類型 | “cellular”, “telephone”, “NA” |
LastContactMonth | 上次聯繫在哪一月 | “jan”, “feb”, etc. |
LastContactDay | 上次聯繫在哪一天 | |
CallStart | 上次通話的開始時間 (HH:MM:SS) | 12:43:15 |
CallEnd | 上次通話的結束時間 (HH:MM:SS) | 12:43:15 |
NoOfContacts | 在此廣告系列中爲此客戶執行的聯繫數量 | |
DaysPassed | 上次聯繫客戶後經過的天數, -1表示還沒有聯繫過 | |
PrevAttempts | 此廣告系列之前爲此客戶執行的聯繫數量 | |
Outcome | 先前營銷活動的結果 | “failure”, “other”, “success”, “NA” |
CarInsurance | 客戶是否購買汽車保險 | “yes” - 1,“no” - 0 |
數據整理
數據整理是將數據從一種形式轉換爲另一種形式以更好地理解它的過程。 在本例中,我們的數據以CSV文件的形式提供給我們,讓我們使用功能強大的python數據科學庫將其加載到數據框中。 好吧,我從未想過它看起來會如此簡單!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
%matplotlib inline
from sklearn.model_selection import train_test_split,cross_val_score,KFold,cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score,confusion_matrix,precision_recall_curve,roc_curve
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier,RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm,tree
df = pd.read_csv('../data/carInsurance_train.csv',index_col = 'Id')
df.head()
Age | Job | Marital | Education | Default | Balance | HHInsurance | CarLoan | Communication | LastContactDay | LastContactMonth | NoOfContacts | DaysPassed | PrevAttempts | Outcome | CallStart | CallEnd | CarInsurance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | ||||||||||||||||||
1 | 32 | management | single | tertiary | 0 | 1218 | 1 | 0 | telephone | 28 | jan | 2 | -1 | 0 | NaN | 13:45:20 | 13:46:30 | 0 |
2 | 32 | blue-collar | married | primary | 0 | 1156 | 1 | 0 | NaN | 26 | may | 5 | -1 | 0 | NaN | 14:49:03 | 14:52:08 | 0 |
3 | 29 | management | single | tertiary | 0 | 637 | 1 | 0 | cellular | 3 | jun | 1 | 119 | 1 | failure | 16:30:24 | 16:36:04 | 1 |
4 | 25 | student | single | primary | 0 | 373 | 1 | 0 | cellular | 11 | may | 2 | -1 | 0 | NaN | 12:06:43 | 12:20:22 | 1 |
5 | 30 | management | married | tertiary | 0 | 2694 | 0 | 0 | cellular | 3 | jun | 1 | -1 | 0 | NaN | 14:35:44 | 14:38:56 | 0 |
df.shape
(4000, 18)
df.columns
Index(['Age', 'Job', 'Marital', 'Education', 'Default', 'Balance',
'HHInsurance', 'CarLoan', 'Communication', 'LastContactDay',
'LastContactMonth', 'NoOfContacts', 'DaysPassed', 'PrevAttempts',
'Outcome', 'CallStart', 'CallEnd', 'CarInsurance'],
dtype='object')
df.describe()
Age | Default | Balance | HHInsurance | CarLoan | LastContactDay | NoOfContacts | DaysPassed | PrevAttempts | CarInsurance | |
---|---|---|---|---|---|---|---|---|---|---|
count | 4000.000000 | 4000.000000 | 4000.000000 | 4000.00000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 |
mean | 41.214750 | 0.014500 | 1532.937250 | 0.49275 | 0.133000 | 15.721250 | 2.607250 | 48.706500 | 0.717500 | 0.401000 |
std | 11.550194 | 0.119555 | 3511.452489 | 0.50001 | 0.339617 | 8.425307 | 3.064204 | 106.685385 | 2.078647 | 0.490162 |
min | 18.000000 | 0.000000 | -3058.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 | -1.000000 | 0.000000 | 0.000000 |
25% | 32.000000 | 0.000000 | 111.000000 | 0.00000 | 0.000000 | 8.000000 | 1.000000 | -1.000000 | 0.000000 | 0.000000 |
50% | 39.000000 | 0.000000 | 551.500000 | 0.00000 | 0.000000 | 16.000000 | 2.000000 | -1.000000 | 0.000000 | 0.000000 |
75% | 49.000000 | 0.000000 | 1619.000000 | 1.00000 | 0.000000 | 22.000000 | 3.000000 | -1.000000 | 0.000000 | 1.000000 |
max | 95.000000 | 1.000000 | 98417.000000 | 1.00000 | 1.000000 | 31.000000 | 43.000000 | 854.000000 | 58.000000 | 1.000000 |
df.dtypes
Age int64
Job object
Marital object
Education object
Default int64
Balance int64
HHInsurance int64
CarLoan int64
Communication object
LastContactDay int64
LastContactMonth object
NoOfContacts int64
DaysPassed int64
PrevAttempts int64
Outcome object
CallStart object
CallEnd object
CarInsurance int64
dtype: object
描述非數值型變量的特點,這裏主要有計數,類別總數,頻數最多的類別及對應頻數。
df.describe(include=['O'])
Job | Marital | Education | Communication | LastContactMonth | Outcome | CallStart | CallEnd | |
---|---|---|---|---|---|---|---|---|
count | 3981 | 4000 | 3831 | 3098 | 4000 | 958 | 4000 | 4000 |
unique | 11 | 3 | 3 | 2 | 12 | 3 | 3777 | 3764 |
top | management | married | secondary | cellular | may | failure | 15:27:56 | 10:22:30 |
freq | 893 | 2304 | 1988 | 2831 | 1049 | 437 | 3 | 3 |
離羣值分析
https://blog.csdn.net/weixin_42056745/article/details/90516835
https://blog.csdn.net/yuxeaotao/article/details/79876377
從箱線圖可以發現,數值範圍比較大,離羣值比較多,但都表現連起來了,但最大值已經超過其他值太多,所以需要刪除,防止過擬合。
sns.boxplot(x='Balance',data=df,palette='hls');
df.Balance.max()
98417
df[df['Balance'] == 98417]
Age | Job | Marital | Education | Default | Balance | HHInsurance | CarLoan | Communication | LastContactDay | LastContactMonth | NoOfContacts | DaysPassed | PrevAttempts | Outcome | CallStart | CallEnd | CarInsurance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | ||||||||||||||||||
1743 | 59 | management | married | tertiary | 0 | 98417 | 0 | 0 | telephone | 20 | nov | 5 | -1 | 0 | NaN | 10:51:42 | 10:54:07 | 0 |
df.index[1742]
1743
#刪除異常值對應的索引值
df_new = df.drop(df.index[1742]);
處理缺失值
缺失值是數據分析的主要問題,處理它們是另一個障礙。 Python將丟失的數據視爲NaN,但不將其包括在計算和可視化中。 同樣,如果不處理缺失值就無法建立預測模型。 在我們的情況下,缺失值主要發生在Outcome和Communication字段中。 Job和Education也具有一定量的缺失值。
像Job和Education這樣的缺失值非常少,可以使用python中的backfill / frontfill pad方法估算。結果和Communication缺失值很多,因此對於NaN值使用None估算。
fillna: https://blog.csdn.net/weixin_39549734/article/details/81221276
df_new.isnull().sum()
Age 0
Job 19
Marital 0
Education 169
Default 0
Balance 0
HHInsurance 0
CarLoan 0
Communication 902
LastContactDay 0
LastContactMonth 0
NoOfContacts 0
DaysPassed 0
PrevAttempts 0
Outcome 3041
CallStart 0
CallEnd 0
CarInsurance 0
dtype: int64
#method ='pad'用前一個非缺失值去填充該缺失值
df_new['Job'] = df_new['Job'].fillna(method ='pad')
df_new['Education'] = df_new['Education'].fillna(method ='pad')
df_new['Communication'] = df_new['Communication'].fillna('none')
df_new['Outcome'] = df_new['Outcome'].fillna('none')
df_new['Outcome'].value_counts()
none 3041
failure 437
success 326
other 195
Name: Outcome, dtype: int64
將Outcome字段的缺失值填充爲none,none的頻數顯示也剛好爲3041
df_new.isnull().sum()
Age 0
Job 0
Marital 0
Education 0
Default 0
Balance 0
HHInsurance 0
CarLoan 0
Communication 0
LastContactDay 0
LastContactMonth 0
NoOfContacts 0
DaysPassed 0
PrevAttempts 0
Outcome 0
CallStart 0
CallEnd 0
CarInsurance 0
dtype: int64
相關性
相關用於確定兩個變量/字段之間的關係。 相關性從-1到1不等; 如果“相關”爲1,則字段爲正相關,“ 0”沒有相關,而“ -1”爲負相關。 讓我們看看使用Heatmap時每個屬性如何相互關聯。 變量之間似乎沒有太多的相關性,但是DaysPassed和PrevAttempts之間具有正相關。
sns.set(style="white")
corr = df_new.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr,annot=True, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
可視化
可視化是數據科學的一個重要方面,沒有它就很難輕易地得出結果。儘管結果在表中是確定的,但是查看細節並得出結論是一個痛點。圖表/圖形對非技術人員輕鬆完成這些任務非常有幫助。高管人員和經理們喜歡以可視化的方式查看報告,以便他們可以輕鬆地制定複雜的決策。下面是一個配對圖,可以將感興趣的字段配對並繪製出來。 Pairplot的變量是從熱圖中選擇的,這些變量會影響結果
** Pairplot的關鍵要點**
*30-60歲更有可能購買汽車保險【(1,1)圖】。
*有汽車貸款和買過家庭保險的人購買的可能性較小一些【(3,3)(4,4)位置的雙峯圖】。
*如果過去的天數(聯繫他們之前的時間)增加,則人們會給出正號【(7,6)圖】。
*當經常與他人聯繫時,他們的購買傾向會在20多次接觸後減大幅減少【最後一排圖】。
*在此廣告系列中爲此客戶執行的聯繫數量越多效果越好,即增加了汽車保險的購買【(7,5)圖】。
df_sub = ['Age','Balance','HHInsurance', 'CarLoan','NoOfContacts','DaysPassed','PrevAttempts','CarInsurance'] #這裏都是數值變量
sns.pairplot(df_new[df_sub],hue='CarInsurance',size=1.5); #注意這裏df_sub包含因變量CarInsurance
PairGrid幫助我們查看了CarInsurance,Balance和分類變量(如Education,Marital和Job)之間的關係。學生和退休人員購買的汽車保險最多【(1,3)圖】,單身身份和受過高等教育的人也更傾向購買汽車保險【(1,1),(1,2)圖】。下面一層圖可以觀察哪些人的年平均餘額比較多。CarInsurance的範圍是[0,1],反映了購買保險比例,而Balance反映了一個平均值水平。
g = sns.PairGrid(df_new,
x_vars=["Education","Marital", "Job"],
y_vars=["CarInsurance", "Balance"],
aspect=.75, size=6)
plt.xticks(rotation=90)
g.map(sns.barplot, palette="pastel");
小提琴圖在y軸處的凸出值接近1,表明3月,9月,10月和12月是人們購買汽車保險的理想月份。
sns.violinplot(x="LastContactMonth",y='CarInsurance',data=df_new);
sns.countplot(x="Outcome",hue='CarInsurance',data=df_new);
特徵工程
特徵工程是機器學習問題的基本要素。 在我們的問題中,有一系列連續變量,例如Age和Balance,需要將它們進行裝箱。 使用四分位數剪切功能將“年齡”和“平衡”連續變量分類爲5個部分。
pd.qcut:https://blog.csdn.net/starter_____/article/details/79327997
#qcut將兩個屬性按頻數均分成5個區間,值爲0,1,2,3,4
df_new['AgeBinned'] = pd.qcut(df_new['Age'], 5 , labels = False)
df_new['BalanceBinned'] = pd.qcut(df_new['Balance'], 5,labels = False)
關於CallStart和CallEnd屬性似乎存在一個獨特的問題,它們記錄爲可以使用datetime函數輕鬆計算的對象變量,因此將其轉換爲datetime函數並減去它們會得出實際的CallTime,可以對其進一步進行分箱 如上。
#將CallStart和CallEnd轉換爲datetime數據類型
df_new['CallStart'] = pd.to_datetime(df_new['CallStart'] )
df_new['CallEnd'] = pd.to_datetime(df_new['CallEnd'] )
#結束時間-開始時間以得出實際的通話時間
df_new['CallTime'] = (df_new['CallEnd'] - df_new['CallStart']).dt.total_seconds()
#分組
df_new['CallTimeBinned'] = pd.qcut(df_new['CallTime'], 5,labels = False)
#刪除被合併的原始列,爲了使變量看起來更簡潔
df_new.drop(['Age','Balance','CallStart','CallEnd','CallTime'],axis = 1,inplace = True)
分類變量也可以參與模型構建,前提是它們必須獲得其虛擬值才能被包括在內。通過此過程,我們將在數據框中包含更多列。
get_dummies用法:https://blog.csdn.net/maymay_/article/details/80198468
#使用get_dummies函數將二進制值分配給分類列中的每個值
Job = pd.get_dummies(data = df_new['Job'],prefix = "Job")
Marital= pd.get_dummies(data = df_new['Marital'],prefix = "Marital")
Education= pd.get_dummies(data = df_new['Education'],prefix="Education")
Communication = pd.get_dummies(data = df_new['Communication'],prefix = "Communication")
LastContactMonth = pd.get_dummies(data = df_new['LastContactMonth'],prefix= "LastContactMonth")
Outcome = pd.get_dummies(data = df_new['Outcome'],prefix = "Outcome")
#刪除已分配了虛擬變量的類別列
df_new.drop(['Job','Marital','Education','Communication','LastContactMonth','Outcome'],axis=1,inplace=True)
#合併需要用到的所有列
df = pd.concat([df_new,Job,Marital,Education,Communication,LastContactMonth,Outcome],axis=1)
df.columns
Index(['Default', 'HHInsurance', 'CarLoan', 'LastContactDay', 'NoOfContacts',
'DaysPassed', 'PrevAttempts', 'CarInsurance', 'AgeBinned',
'BalanceBinned', 'CallTimeBinned', 'Job_admin.', 'Job_blue-collar',
'Job_entrepreneur', 'Job_housemaid', 'Job_management', 'Job_retired',
'Job_self-employed', 'Job_services', 'Job_student', 'Job_technician',
'Job_unemployed', 'Marital_divorced', 'Marital_married',
'Marital_single', 'Education_primary', 'Education_secondary',
'Education_tertiary', 'Communication_cellular', 'Communication_none',
'Communication_telephone', 'LastContactMonth_apr',
'LastContactMonth_aug', 'LastContactMonth_dec', 'LastContactMonth_feb',
'LastContactMonth_jan', 'LastContactMonth_jul', 'LastContactMonth_jun',
'LastContactMonth_mar', 'LastContactMonth_may', 'LastContactMonth_nov',
'LastContactMonth_oct', 'LastContactMonth_sep', 'Outcome_failure',
'Outcome_none', 'Outcome_other', 'Outcome_success'],
dtype='object')
通常通過在已知輸出(標記的數據)上對模型進行訓練來對模型進行評估,以使模型可以從中學習,並使用未標記的數據進行測試,從而可以確定模型的預測準確性,從而進行訓練測試拆分。
train_test_split:https://blog.csdn.net/Lynn_mg/article/details/83062630
X= df.drop(['CarInsurance'],axis=1).values
y=df['CarInsurance'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=42, stratify = y) #將stratify=y就是按照y中的比例分配
預測模型的建立和驗證
**預測模型** sklearn中集成了很多分類預測算法,在我們的案例中,我們利用了與問題相關的大多數分類算法。 我們的分類器包括 1. kNN 2. Logistic Regression 3. SVM 4. Decision Tree 5. Random Forest 6. AdaBoost 7. XGBoost **交叉驗證**交叉驗證用於將數據分爲訓練集和測試集,以評估模型的性能。 在KFold中,K確定要在數據上進行劃分的數目,並從中使用1個樣本進行訓練,而在我們的案例中,將10-1作爲樣本用於驗證。 每個模型的交叉驗證得分是通過將模型分爲10折來評估的。
最好的模型是** Random Forest 和 XGBoost **,它們都以良好的準確性得分很好地完成了自己的任務。
#以下矩陣的代碼來自sklearn文檔
#定義混淆矩陣函數
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
class_names = ['Success','Failure']
knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train,y_train)
print ("kNN Accuracy is %2.2f" % accuracy_score(y_test, knn.predict(X_test)))
#10折交叉驗證
score_knn = cross_val_score(knn, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_knn)
y_pred= knn.predict(X_test)
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test,y_pred)
#畫混淆矩陣
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
kNN Accuracy is 0.76
Cross Validation Score = 0.75
precision recall f1-score support
0 0.75 0.90 0.82 479
1 0.78 0.55 0.65 321
accuracy 0.76 800
macro avg 0.76 0.72 0.73 800
weighted avg 0.76 0.76 0.75 800
#Logistic Regression Classifier
LR = LogisticRegression()
LR.fit(X_train,y_train)
print ("Logistic Accuracy is %2.2f" % accuracy_score(y_test, LR.predict(X_test)))
score_LR = cross_val_score(LR, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_LR)
y_pred = LR.predict(X_test)
print(classification_report(y_test, y_pred))
# Confusion matrix for LR
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
Logistic Accuracy is 0.83
Cross Validation Score = 0.81
precision recall f1-score support
0 0.85 0.87 0.86 479
1 0.80 0.78 0.79 321
accuracy 0.83 800
macro avg 0.82 0.82 0.82 800
weighted avg 0.83 0.83 0.83 800
SVM = svm.SVC()
SVM.fit(X_train, y_train)
print ("SVM Accuracy is %2.2f" % accuracy_score(y_test, SVM.predict(X_test)))
score_svm = cross_val_score(SVM, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_svm)
y_pred = SVM.predict(X_test)
print(classification_report(y_test,y_pred))
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
SVM Accuracy is 0.67
Cross Validation Score = 0.66
precision recall f1-score support
0 0.66 0.91 0.77 479
1 0.70 0.31 0.43 321
accuracy 0.67 800
macro avg 0.68 0.61 0.60 800
weighted avg 0.68 0.67 0.63 800
# Decision Tree Classifier
DT = tree.DecisionTreeClassifier(random_state = 0,class_weight="balanced",
min_weight_fraction_leaf=0.01)
DT = DT.fit(X_train,y_train)
print ("Decision Tree Accuracy is %2.2f" % accuracy_score(y_test, DT.predict(X_test)))
score_DT = cross_val_score(DT, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_DT)
y_pred = DT.predict(X_test)
print(classification_report(y_test, y_pred))
# Confusion Matrix for Decision Tree
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
Decision Tree Accuracy is 0.82
Cross Validation Score = 0.81
precision recall f1-score support
0 0.88 0.81 0.84 479
1 0.74 0.83 0.79 321
accuracy 0.82 800
macro avg 0.81 0.82 0.81 800
weighted avg 0.82 0.82 0.82 800
#Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=10,class_weight="balanced")
rfc.fit(X_train, y_train)
print ("Random Forest Accuracy is %2.2f" % accuracy_score(y_test, rfc.predict(X_test)))
score_rfc = cross_val_score(rfc, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_rfc)
y_pred = rfc.predict(X_test)
print(classification_report(y_test,y_pred ))
#Confusion Matrix for Random Forest
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
Random Forest Accuracy is 0.86
Cross Validation Score = 0.84
precision recall f1-score support
0 0.90 0.86 0.88 479
1 0.80 0.86 0.83 321
accuracy 0.86 800
macro avg 0.85 0.86 0.85 800
weighted avg 0.86 0.86 0.86 800
#AdaBoost Classifier
ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
ada.fit(X_train,y_train)
print ("AdaBoost Accuracy= %2.2f" % accuracy_score(y_test,ada.predict(X_test)))
score_ada = cross_val_score(ada, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_ada)
y_pred = ada.predict(X_test)
print(classification_report(y_test,y_pred ))
#Confusion Marix for AdaBoost
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
AdaBoost Accuracy= 0.83
Cross Validation Score = 0.82
precision recall f1-score support
0 0.83 0.90 0.86 479
1 0.82 0.73 0.77 321
accuracy 0.83 800
macro avg 0.83 0.81 0.82 800
weighted avg 0.83 0.83 0.83 800
#XGBoost Classifier
xgb = GradientBoostingClassifier(n_estimators=1000,learning_rate=0.01)
xgb.fit(X_train,y_train)
print ("GradientBoost Accuracy= %2.2f" % accuracy_score(y_test,xgb.predict(X_test)))
score_xgb = cross_val_score(xgb, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_ada)
y_pred = xgb.predict(X_test)
print(classification_report(y_test,y_pred))
#Confusion Matrix for XGBoost Classifier
cm_xg = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm_xg, classes=class_names, title='Confusion matrix')
GradientBoost Accuracy= 0.85
Cross Validation Score = 0.82
precision recall f1-score support
0 0.87 0.89 0.88 479
1 0.82 0.79 0.81 321
accuracy 0.85 800
macro avg 0.84 0.84 0.84 800
weighted avg 0.85 0.85 0.85 800
ROC曲線
ROC繪製了所有模型,並向左上方繪製了Gradient Boosting(XGBoost)和Randomforest的對應曲線,表明這些預測器模型是最好的
ROC: https://blog.csdn.net/kMD8d5R/article/details/98552574?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-4.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-4.nonecase
#Obtaining False Positive Rate, True Positive Rate and Threshold for all classifiers
fpr, tpr, thresholds = roc_curve(y_test, knn.predict_proba(X_test)[:,1])
LR_fpr, LR_tpr, thresholds = roc_curve(y_test, LR.predict_proba(X_test)[:,1])
#SVM_fpr, SVM_tpr, thresholds = roc_curve(y_test, SVM.predict_proba(X_test)[:,1])
DT_fpr, DT_tpr, thresholds = roc_curve(y_test, DT.predict_proba(X_test)[:,1])
rfc_fpr, rfc_tpr, thresholds = roc_curve(y_test, rfc.predict_proba(X_test)[:,1])
ada_fpr, ada_tpr, thresholds = roc_curve(y_test, ada.predict_proba(X_test)[:,1])
xgb_fpr, xgb_tpr, thresholds = roc_curve(y_test, xgb.predict_proba(X_test)[:,1])
#PLotting ROC Curves for all classifiers
plt.plot(fpr, tpr, label='KNN' )
plt.plot(LR_fpr, LR_tpr, label='Logistic Regression')
#plt.plot(SVM_fpr, SVM_tpr, label='SVM')
plt.plot(DT_fpr, DT_tpr, label='Decision Tree')
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest')
plt.plot(ada_fpr, ada_tpr, label='AdaBoost')
plt.plot(xgb_fpr, xgb_tpr, label='GradientBoosting')
# Plot Base Rate ROC
plt.plot([0,1],[0,1],label='Base Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()
特徵重要性
重要特徵識別是通過使用諸如Logistic迴歸和決策樹之類的模型完成的。 兩者在識別特徵時都非常清晰。 下圖顯示了ExtraTreesClassifier確定的最重要變量,而前10個變量是
- CallTime
- LastContactDay
- Balance
- NoofContacts
- Outcome_success
- Age
- HHInsurance
- Communication_none
- Dayspassed
- Outcome_none
#使用遞歸特徵消除函數並將其擬合到Logistic迴歸模型中
modell = LogisticRegression()
rfe = RFE(modell, 5)
rfe = rfe.fit(X_train,y_train)
# 顯示變量等級排序
rfe.ranking_
array([10, 9, 16, 41, 34, 42, 32, 36, 33, 3, 24, 15, 17, 27, 28, 20, 22,
29, 2, 40, 26, 38, 21, 37, 31, 30, 25, 19, 1, 18, 39, 7, 11, 35,
4, 5, 23, 1, 6, 8, 1, 1, 13, 12, 14, 1])
#使用ExtraTreesClassifier模型函數
model = ExtraTreesClassifier()
model.fit(X_train, y_train)
print(model.feature_importances_)
importances = model.feature_importances_
feat_names = df.drop(['CarInsurance'],axis=1).columns
#通過按重要性順序對功能重要性進行排序將其顯示爲圖表
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,6))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color='lightblue', align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()
[0.00268756 0.0313893 0.01658237 0.06520325 0.04730283 0.01636741
0.01347456 0.04461972 0.04937946 0.25765298 0.01216092 0.01304826
0.00579582 0.00493788 0.01324779 0.0097079 0.00640912 0.0091661
0.00601809 0.01457932 0.00655431 0.01135298 0.01605904 0.01331409
0.00989652 0.01628276 0.01403667 0.01703035 0.02387476 0.00606056
0.01792573 0.01598123 0.00329203 0.0090054 0.00783392 0.01389945
0.01580459 0.01121934 0.01654914 0.00965812 0.01079382 0.00951623
0.00932224 0.02123694 0.00591377 0.04785536]
rfc = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=10,class_weight="balanced")
y_proba = cross_val_predict(rfc, X, y, cv=10, n_jobs=-1, method='predict_proba')
results = pd.DataFrame({'y': y, 'y_proba': y_proba[:,1]})
results = results.sort_values(by='y_proba', ascending=False).reset_index(drop=True)
results.index = results.index + 1
results.index = results.index / len(results.index) * 100
sns.set_style('darkgrid')
pred = results
pred['Lift Curve'] = pred.y.cumsum() / pred.y.sum() * 100
pred['Baseline'] = pred.index
base_rate = y.sum() / len(y) * 100
pred[['Lift Curve', 'Baseline']].plot(style=['-', '--', '--'])
pd.Series(data=[0, 100, 100], index=[0, base_rate, 100]).plot(style='--')
plt.title('Cumulative Gains')
plt.xlabel('% of Customers Contacted')
plt.ylabel("% of Positive Results")
plt.legend(['Lift Curve', 'Baseline', 'Ideal']);
提出建議
** 1。 培訓在呼叫中心工作的員工的人際交往能力,使他們在通話中變得更加友好和參與**
** 2。 保持跟蹤器的作用,以提醒後續行動,以便代表可以再次與該人交談並嘗試說服
他們購買汽車保險**
** 3。 選擇具有良好信用評分和帳戶餘額的人,以便在他們身上花費的時間是有用的**
** 4。 專注於40歲以上的老年人,因爲根據以前的數據,很容易折衷爲新計劃**
** 5。 上一個廣告系列中的聯絡人做出了迴應,因爲他們更有可能購買保險**