Table of Contents

之前參加了一次公司舉辦的客戶流失預測比賽，當時是機器學習小白（雖然現在也是），爲了這場比賽收集整理了不少資料，在此稍微做一個簡單的總結，和大家分享一下。

0. 題目簡介

問題：預測客戶未來三個月續訂訂單的概率
數據源：包括Product、ProductFamily、Service、Service Family、Partner、Customer、Order、Order Line、Offer、Customer’s Company、Usage、Revenue、Bill等多個維度在內的數據
評判標準：Score Set的交叉熵
模型選擇：最後使用了XGBoost和GBDT融合

1. 資源收集和整理

因爲之前沒有接觸過機器學習相關的工作，是個純小白，所以我們首先手機資料，瞭解相關比賽所採用的思路和方法。通過學習瞭解到通用的機器學習流程如下：

數據分析和預處理
特徵工程（特徵選擇）
模型選擇（分類問題/迴歸問題)
訓練和調優

接下來，針對這些步驟，我們對其進行更細緻的強化學習。

2. 相關學習資料

2.1 相似比賽思路

通過閱讀下面的博客，瞭解到這種類型的比賽通用的解題流程，對機器學習解決問題的流程有了初步的概念。

O2O優惠券使用預測複賽第三名：http://blog.csdn.net/bryan__/article/details/53907292
O2O優惠券使用預測複賽第一名代碼：https://github.com/wepe/O2O-Coupon-Usage-Forecast
各類比賽的思路總結：http://blog.csdn.net/bryan__/article/details/51713596
【天池競賽系列】阿里移動推薦算法思路解析：http://blog.csdn.net/bryan__/article/details/47112993
大數據競賽技術分享：http://blog.csdn.net/bryan__/article/details/51745563
Scikit-learn 預測用戶流失： http://blog.csdn.net/BaiHuaXiu123/article/details/62063415
天池數據挖掘比賽技術與套路總結：https://blog.csdn.net/mr_tyting/article/details/73548245
美團流失預測：很具有參考價值 - http://blog.csdn.net/shenxiaoming77/article/details/51543724
用戶流失分析中的關鍵技術：http://blog.csdn.net/u013915133/article/details/78525133
基於python的邏輯迴歸實現及數據挖掘應用案例講解：http://blog.csdn.net/yawei_liu1688/article/details/78733428
LogisticRegression用戶流失預測模型初探：http://blog.csdn.net/java1573/article/details/78830607

2.2 數據處理+特徵工程

2.2.1 數據處理：

數據可視化: http://blog.csdn.net/mr_tyting/article/details/73196119
數據離散化：http://blog.csdn.net/mr_tyting/article/details/75212250
數據清理：
- 異常樣本檢測和去除極端數據：http://blog.csdn.net/mr_tyting/article/details/77371157
數據預處理方法：http://blog.csdn.net/sinat_33761963/article/details/53433799
數據預處理： http://blog.csdn.net/bryan__/article/details/51228971
機器學習-->sklearn數據預處理：http://blog.csdn.net/mr_tyting/article/details/73381661

2.2.2 特徵工程：

特徵工程完全總結（Python源碼）：https://blog.csdn.net/javastart/article/details/77015603
特徵選擇
- 概述：
  - 機器學習-->特徵選擇：http://blog.csdn.net/mr_tyting/article/details/73413979
  - http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
  - Python機器學習庫SKLearn的特徵選擇：https://blog.csdn.net/cheng9981/article/details/71023709
  - 特徵選擇的方法： https://blog.csdn.net/bryan__/article/details/51607215
  - 特徵選擇： http://blog.csdn.net/bryan__/article/details/51607215
- 自動特徵選擇：
  - 使用GBDT與LR融合自動選擇特徵：http://blog.csdn.net/lilyth_lilyth/article/details/48032119
  - 利用GBDT構建組合特徵： http://blog.csdn.net/sb19931201/article/details/65445514
  - 利用隨機森林評估特徵重要性：
    - http://blog.csdn.net/HowardWood/article/details/79525326
    - http://blog.csdn.net/yawei_liu1688/article/details/78733428 -- 實例講解

2.3 模型和參數調優

通用：
- 模型選擇： http://blog.csdn.net/mr_tyting/article/details/73440712
- 模型融合：http://blog.csdn.net/mr_tyting/article/details/72957853
- 改善模型的方法： http://blog.csdn.net/roslei/article/details/53465283
- python sklearn 分類算法簡單調用： http://blog.csdn.net/bryan__/article/details/51288953
XGBoost+GBDT
- XGBoost API: http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit
- XGBoost 源碼：https://github.com/dmlc/xgboost
- 餘音大神-介紹GBDT，XGBoost，Blending等實現方法https://github.com/lytforgood/MachineLearningTrick
- GBDT原理及應用： http://blog.csdn.net/q383700092/article/details/53744277
- XGBoost 參數調優：
  - https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python
  - 機器學習系列(12)_XGBoost參數調優完全指南（附Python代碼） https://blog.csdn.net/han_xiaoyang/article/details/52665396
  - XGBoost參數調優完全指南（附Python代碼）： https://blog.csdn.net/u010657489/article/details/51952785
  - Parameters Tuning: bias-variance tradeoff：
    - 參數說明文檔： https://xgboost.readthedocs.io/en/latest/parameter.html
  - 參數調參和預處理樣例：https://github.com/aarshayj/Analytics_Vidhya/tree/master/Articles/Parameter_Tuning_XGBoost_with_Example

3. 一些筆記

3.1 《結合Scikit-learn介紹幾種常用的特徵選擇方法》

如何用迴歸模型的係數來選擇特徵。越是重要的特徵在模型中對應的係數就會越大，而跟輸出變量越是無關的特徵對應的係數就會越接近於0。在噪音不多的數據上，或者是數據量遠遠大於特徵數的數據上，如果特徵之間相對來說是比較獨立的，那麼即便是運用最簡單的線性迴歸模型也一樣能取得非常好的效果。

在很多實際的數據當中，往往存在多個互相關聯的特徵，這時候模型就會變得不穩定，數據中細微的變化就可能導致模型的巨大變化（模型的變化本質上是係數，或者叫參數，可以理解成W），這會讓模型的預測變得困難，這種現象也稱爲多重共線性。

3.2 《[機器學習實戰]使用 scikit-learn 預測用戶流失》

流程：數據預處理 (char --> bool, delete less useful columns, 歸一化）--> 模型：KNN, SVM, RF（交叉驗證）

3.3 《天池數據挖掘比賽技術與套路總結》

流程：數據可視化 ----》數據預處理---》特徵工程 ---》模型融合

數據可視化：驗證我們對數據分佈的一些猜想，使我們對數據分佈有一個清晰的認識和理解，並且由此設計一些合理的人工規則。
- 參考：http://blog.csdn.net/mr_tyting/article/details/73196119
數據預處理：
- 數據清洗：
  - 異常樣本檢測和去除極端數據 http://blog.csdn.net/mr_tyting/article/details/77371157
  - 缺省字段處理（缺省值很多，非連續特徵缺省值適中，連續特徵缺省值適中，缺省較少）
- 數據採樣：正負樣本不均衡問題
特徵工程：
- 特徵處理 http://blog.csdn.net/mr_tyting/article/details/73381661
- 連續特徵離散化 http://blog.csdn.net/mr_tyting/article/details/75212250
- 特徵選擇： http://blog.csdn.net/mr_tyting/article/details/73413979
- 模型選擇： http://blog.csdn.net/mr_tyting/article/details/73440712
模型融合：http://blog.csdn.net/mr_tyting/article/details/72957853
- Bagging （Random Forest）
- Stacking
- Boosting

3.4 《LogisticRegression用戶流失預測模型初探【推薦】》

邏輯迴歸:
- Logistic regression 是二項分佈，比較常用來進行二分類
- Logistic迴歸的主要用途：
  - 尋找危險因素：尋找某一疾病的危險因素等；
  - 預測：根據模型，預測在不同的自變量情況下，發生某病或某種情況的概率有多大；
  - 判別：實際上跟預測有些類似，也是根據模型，判斷某人屬於某病或屬於某種情況的概率有多大，也就是看一下這個人有多大的可能性是屬於某病。
解決問題:

根據自己工作中的一個產品作爲主題，預測其用戶流失與留存。

流失=上個月有消費，本月無消費表流失（其實也是消費流失啦）。

數據週期使用的是一兩個月來做分析，什麼情況下用戶會消費流失？

於是挑選了一些指標特徵來做分析，比如上個月的消費次數、最近的消費時間（可量化），消費金額，rmf這個原理還是有一個分析依據的。當然還有其他特徵如，用戶觀看總時長、用戶活躍天數、停留時長、啓動次數等。

用戶流失分析中的關鍵技術:

分類模型：決策樹（ID3，C4.5 ， C50）

流程: 準備工作（明確自變量和因變量，確定信息度量的方式（熵，信息增益），確定最終的終止條件（純度，記錄條數，循環次數）） ---》選擇特徵 ---》創建分支 ----》是否終止----》結果生成

3.5 《基於Python的邏輯迴歸實現及數據挖掘應用案例講解》

Step1：數據庫提取數據：結合106個指標
Step2：數據查看及處理
Step3：LR模型訓練
Step4：模型預測及評估
Step5：模型優化：交叉驗證+grid search重新訓練，

注意：模型的查全率recall和查準率precision那個更重要。判斷那個度量指標最重要！！本次比賽中precision更重要

3.6 《大數據競賽技術分享》

預處理 http://blog.csdn.net/bryan__/article/details/51228971
- 特徵標準化：z-score，使其具有0均值，單位方差
- 最大最小規範化
- 規範化：規範化是將不同變化範圍的值映射到相同的固定範圍，常見的是[0,1]，此時也稱爲歸一化。（L1，L2）
- 特徵二值化
- 標籤二值化
- 類別特徵編碼
- 標籤編碼
- 特徵中含有異常值
- 生成多項式特徵
特徵工程
- 按業務邏輯構建特徵
- 交叉特徵
- 變換特徵
- 基於時間窗滑動的特徵
- 避免特徵穿越
- 尺度一致
- 連續特徵離散化
- 離散特徵連續化（one-hot，sklearn-lableencoder
- 使用GBDT(Gradient Boost Decision Tree)與LR融合，自動發現組合特徵，省去人工構造
  - http://blog.csdn.net/lilyth_lilyth/article/details/48032119
- GBDT+FM
- 特徵選擇： http://blog.csdn.net/bryan__/article/details/51607215
模型設計=分類問題： http://blog.csdn.net/bryan__/article/details/51288953
模型設計-迴歸問題

3.7 Xgboost參數說明

參數說明文檔： https://xgboost.readthedocs.io/en/latest/parameter.html

Control overfitting:

First way: directly control model complexity
- Include max_depth, min_child_weight, gamma
Second way: add randomness to make training robust to noise
- Subsample, colsample_bytree
- Reduce step_size eta, but needs to remember to increase num_round when you do so\

Handle imbalanced dataset

訓練樣本中，類別之間的樣本數據量比例超過4：1，可以認爲樣本存在不均衡的問題

If you care only about the ranking order (AUC) of your prediction
- Balance the positive and negative weights, via scale_pos_weight
- Use AUC for evaluation
If you care about predicting the right probability
- In such a case ,you cannot rebalance the dataset
- In such a case, set parameter max_delta_step to finite (say 1) will help convergence

Xgboost 參數說明

Before running xgboost, set three types of parameters: general, booster and learning task parameters
- General parameters: relates to which booster we are using to do boosting, commonly tree and linear model
- Booster parameters: depends on which booster you have chosen
- Learning task parameters that decides on the learning scenario, for example, regress tasks may use different parameters with ranking tasks
General parameters:
- Booster [default=gbtree]
  - Which booster to use, can be gbtree, gblinear or dart. Gbtree and dart based on tree model, which gblinear uses linear function
- Silent [default=0]
  - 0 means printing running message; 1 means silent mode
- Nthread [default to maximum number of thread availiable if not set]
- Num_pbuffer[set automatically by xgboost]
- Num_feature[set automatically by xgboost]
Parameters for tree booster
- Eta[default=0.3, alias: learning_rate]
  - Step size shrinkage used in update to prevents overfitting.
  - Range: [0,∞]
  - The larger, the more conservative
- Gamma[default=0, alias: min_split_loss]
  - Minimum loss reduction required to make a further partition on a leaf node of a tree.
  - The larger, the more conservative the algorithm will be.
  - range: [0,∞]
- max_depth [default=6]
  - maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. 0 indicates no limit, limit is required for depth-wise grow policy.
  - range: [0,∞]

客戶流失預測 —— 資源彙總

0. 題目簡介

1. 資源收集和整理

2. 相關學習資料

2.1 相似比賽思路

2.2 數據處理+特徵工程

2.2.1 數據處理：

2.2.2 特徵工程：

2.3 模型和參數調優

3. 一些筆記

3.1 《結合Scikit-learn介紹幾種常用的特徵選擇方法》

3.2 《[機器學習實戰]使用 scikit-learn 預測用戶流失》

3.3 《天池數據挖掘比賽技術與套路總結》

3.4 《LogisticRegression用戶流失預測模型初探【推薦】》

3.5 《基於Python的邏輯迴歸實現及數據挖掘應用案例講解》

3.6 《大數據競賽技術分享》

3.7 Xgboost參數說明

Control overfitting:

Handle imbalanced dataset

Xgboost 參數說明

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

《Attention is All You Need》論文學習筆記

Mac上安裝 Cx_Oracle+Python+Pycharm+SQL Developer安裝

C語言快速入門和相關資料

Docker入門 - 簡介/安裝/運行原理/常用命令/鏡像/容器/DockerFile

學習資源整理：Java/BigData/C/C++/Python/NLP/ML/DL/CV/數據分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結