數據挖掘實戰：汽車銷售業偷漏稅識別

案例來自《python數據分析與挖掘實戰》

數據集可以到天池下載

背景

問題

企業偷漏稅氾濫，影響國家經濟基礎
汽車銷售業，少開發票、少計收入、保修索賠款不及時確認等偷漏稅行爲

目標

根據汽車銷售行業納稅人的各項經營指標，建立模型，識別偷漏稅的企業

數據分析

已知數據

處理流程

類似：數據挖掘實戰：電力竊漏電用戶自動識別

準備工作

數據集下載：python_data_analysis_and_mining_action

代碼練習平臺：google colab

上傳數據到google colab

from google.colab import files
files.upload()

數據預處理

讀取數據

import pandas as pd
data = pd.read_csv("汽車銷售行業納稅人偷漏稅數據.csv")
data=data.drop(columns='納稅人編號')

數據信息

data.info()
data.describe()

顯示如下，沒有缺失值：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124 entries, 0 to 123
Data columns (total 15 columns):
銷售類型             124 non-null object
銷售模式             124 non-null object
汽車銷售平均毛利         124 non-null float64
維修毛利             124 non-null float64
企業維修收入佔銷售收入比重    124 non-null float64
增值稅稅負            124 non-null float64
存貨週轉率            124 non-null float64
成本費用利潤率          124 non-null float64
整體理論稅負           124 non-null float64
整體稅負控制數          124 non-null float64
辦牌率              124 non-null float64
單臺辦牌手續費收入        124 non-null float64
代辦保險率            124 non-null float64
保費返還率            124 non-null float64
輸出               124 non-null object
dtypes: float64(12), object(3)
memory usage: 14.7+ KB
汽車銷售平均毛利	維修毛利	企業維修收入佔銷售收入比重	增值稅稅負	存貨週轉率	成本費用利潤率	整體理論稅負	整體稅負控制數	辦牌率	單臺辦牌手續費收入	代辦保險率	保費返還率
count	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000
mean	0.023709	0.154894	0.068717	0.008287	11.036540	0.174839	0.010435	0.006961	0.146077	0.016387	0.169976	0.039165
std	0.103790	0.414387	0.158254	0.013389	12.984948	1.121757	0.032753	0.008926	0.236064	0.032510	0.336220	0.065910
min	-1.064600	-3.125500	0.000000	0.000000	0.000000	-1.000000	-0.181000	-0.007000	0.000000	0.000000	0.000000	-0.014800
25%	0.003150	0.000000	0.000000	0.000475	2.459350	-0.004075	0.000725	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.025100	0.156700	0.025950	0.004800	8.421250	0.000500	0.009100	0.006000	0.000000	0.000000	0.000000	0.000000
75%	0.049425	0.398925	0.079550	0.008800	15.199725	0.009425	0.015925	0.011425	0.272325	0.020000	0.138500	0.081350
max	0.177400	1.000000	1.000000	0.077000	96.746100	9.827200	0.159300	0.057000	0.877500	0.200000	1.529700	0.270000

將類別型屬性轉爲數值型

data[u'輸出'] = data[u'輸出'].map({u'正常': 0, u'異常': 1})
data[u'銷售類型'] = data[u'銷售類型'].map({u'國產轎車': 1, u'進口轎車': 2, u'大客車': 3,
                                           u'卡車及輕卡': 4, u'微型麪包車': 5, u'商用貨車': 6,
                                           u'工程車': 7, u'其它': 8})
data[u'銷售模式'] = data[u'銷售模式'].map({u'4S店': 1, u'一級代理商': 2, u'二級及二級以下代理商': 3,
                                           u'多品牌經營店': 4, u'其它': 5})

模型訓練

劃分訓練集與測試集

from random import shuffle
data = data.as_matrix()
shuffle(data)  # 隨機打亂數據
# 設置訓練數據比8:2
p = 0.8
train = data[:int(len(data) * p), :]
test = data[int(len(data) * p):, :]

混淆矩陣可視化函數

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve
def cm_plot(y, yp):
    cm = confusion_matrix(y, yp)
 
    plt.matshow(cm, cmap=plt.cm.Greens)
    plt.colorbar()
 
    for x in range(len(cm)):
        for y in range(len(cm)):
            plt.annotate(
                cm[x, y],
                xy=(x, y),
                horizontalalignment='center',
                verticalalignment='center')
 
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    return plt

構建LM神經網絡

from keras.layers.core import Activation, Dense
from keras.models import Sequential
    
# 構建LM神經網絡模型
netfile = 'net.model'
 
net = Sequential()  # 建立神經網絡
# 添加輸入層（14節點）到隱藏層（10節點）的連接
net.add(Dense(10, input_shape=(14, )))
net.add(Activation('relu'))  # 隱藏層使用relu激活函數
#添加隱藏層（10節點）到輸出層（1節點）的連接
net.add(Dense(1, input_shape=(10, )))
net.add(Activation('sigmoid'))  # 輸出層使用sigmoid激活函數
net.compile(loss='binary_crossentropy', optimizer='adam',sample_weight_mode="binary")  #編譯模型，使用adam方法求解
net.fit(train[:, :14], train[:, 14], epochs=100, batch_size=1)
net.save_weights(netfile)
 
predict_result = net.predict_classes(train[:, :14]).reshape(len(train))  # 預測結果變形
'''這裏要提醒的是，keras用predict給出預測概率，predict_classes纔是給出預測類別，而且兩者的預測結果都是n x 1維數組，而不是通常的 1 x n'''
cm_plot(train[:, 14], predict_result).show()

predict_result = net.predict(test[:, :14]).reshape(len(test))
fpr, tpr, thresholds = roc_curve(test[:, 14], predict_result, pos_label=1)
plt.plot(fpr, tpr, linewidth=2, label='ROC of LM')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()
print(thresholds)

查看每層網絡名稱，根據輸出的層名獲取權重和偏置：

for layer in net.layers:
  print (layer.name)

#得到每一層名字例如dense_1
w,b = net.get_layer('dense_1').get_weights()

第一層14入10出，權重係數：

array([[-0.33323944, -0.17473654,  0.87461054, -0.8462864 ,  0.25585997,
         0.04722638,  0.65266305, -0.08507098, -0.29267487, -0.25785267],
       [ 0.5202124 ,  0.03159818,  0.4351549 ,  0.81486964,  0.44043756,
        -0.26803496, -0.7636168 ,  0.21452232, -0.0635082 , -0.09977171],
       [ 0.4392051 , -0.42368022, -0.7336786 , -0.46694276, -0.6174017 ,
         0.16300084,  0.9625222 ,  0.0446047 ,  0.52611935,  0.19423229],
       [ 1.8234462 ,  0.06254685, -1.4471666 ,  0.26359963, -1.7079837 ,
         2.2589228 ,  1.7308791 , -0.16358732,  1.3756496 ,  1.4106685 ],
       [ 1.0737519 ,  0.18168104, -1.619209  ,  0.39016438, -1.6890421 ,
         1.3992369 ,  0.9619709 ,  1.0533664 ,  1.6621295 ,  1.8717662 ],
       [-0.92529964, -0.3892988 ,  0.4119622 ,  1.6253519 ,  0.5639266 ,
        -0.40880913, -0.7206734 ,  0.67802995, -0.12859246,  0.16787821],
       [ 0.22189063, -0.1612761 ,  0.18605754, -0.47543216,  0.26998442,
         0.0331102 , -0.02441032, -0.28508526,  0.35632288,  0.07961233],
       [-0.12109375, -0.18872818,  0.54819757,  0.94135314,  0.2317546 ,
        -0.753005  , -0.37716654,  0.51411617, -0.5405725 , -1.0390322 ],
       [ 2.3368533 , -0.04014749, -1.8256061 ,  0.95449895, -2.0859642 ,
         1.8795613 ,  0.61885285,  0.1487553 ,  1.8318983 ,  1.9700083 ],
       [ 1.92646   ,  0.06230724, -1.6802945 ,  0.40118444, -1.2818127 ,
         2.5196013 ,  1.8299726 , -0.08480016,  1.6552794 ,  1.3718019 ],
       [-0.44886026, -0.2739805 ,  0.0532513 , -0.4950459 ,  0.19649826,
         0.18537489, -0.74298495, -0.34665307, -0.28344223, -0.24201499],
       [-0.31021225, -0.11335158,  0.67558414, -0.24258912,  0.3580687 ,
         0.23020446,  0.11817209,  0.37445474,  0.21545577, -0.52446955],
       [ 0.5257165 , -0.02639258, -1.0446917 , -0.49409497, -1.0229766 ,
         1.2764192 , -0.3819442 , -0.37377504,  0.95819145,  0.41725355],
       [ 1.7603374 , -0.24498129, -1.9779907 ,  0.05246234, -1.8831204 ,
         2.1065228 ,  0.329442  , -0.21982898,  1.606011  ,  1.9765277 ]],
      dtype=float32)

偏置：

array([ 0.03276258, -0.0112296 , -0.05813374,  0.19073635, -0.03967857,
        0.49254265,  0.6294645 , -0.13973917,  0.13733548,  0.3180599 ],
      dtype=float32)

第二層10入1出也可以用相同方法查看。

構建CART決策樹

from sklearn.tree import DecisionTreeClassifier
from sklearn.externals import joblib 
 
# 構建CART決策樹模型
treefile = 'tree.pkl'
tree = DecisionTreeClassifier()
tree.fit(train[:, :14], train[:, 14])
 
joblib.dump(tree, treefile)
 
cm_plot(train[:, 14], tree.predict(train[:, :14])).show()  # 顯示混淆矩陣可視化結果
# 注意到Scikit-Learn使用predict方法直接給出預測結果。
 
fpr, tpr, thresholds = roc_curve(test[:, 14], tree.predict_proba(test[:, :14])[:, 1], pos_label=1)
plt.plot(fpr, tpr, linewidth=2, label='ROC of CART', color='green')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# 設定邊界範圍
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()
print(thresholds)

將決策樹可視化

import sklearn.tree
from IPython.display import Image  
import pydotplus 
dot_data = sklearn.tree.export_graphviz(tree, out_file=None, 
                         feature_names=None,  
                         class_names=None,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())