數據處理（2.1）點擊數據處理-lgb 訓練實戰

原創

2020-06-10 21:37

這篇文章主要將上一篇文章中的 lgb 訓練函數列出來，上一篇主要詳細講解預處理和後處理。

import lightgbm as lgb
import numpy as np

1. 輸入參數介紹

輸入參數主要有：

訓練集的特徵列

訓練集的標籤列

驗證集的特徵列

驗證集的標籤列

cate_cols 指明類別特徵

任務的類型 job=“classification”

def base_train(x_train, y_train, x_test, y_test, cate_cols=None, job='classification'):

2. 識別 cate_cols 是否存在，不存在則設定爲 auto

 if not cate_cols:
        cate_cols = 'auto'

3. 轉化爲 dataset ，並建立驗證集

建立驗證集需要將訓練集一起輸入進來

    lgb_train = lgb.Dataset(x_train, y_train, categorical_feature=cate_cols)
    lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train, categorical_feature=cate_cols)

4. 根據 job 選擇訓練參數

其中我們選擇的是分類任務

官方網站： https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html

boosting_type

參數決定使用哪種樹來進行訓練， ‘gbdt’ 表示使用傳統的梯度下降樹進行， ‘dart’ 表示使用加法式的迴歸樹，也就是 ada？等（猜測）樹進行訓練，‘goss’表示基於梯度的單邊採樣，‘rf’表示隨機森林

objective

指定學習任務以及要使用的相應學習目標或自定義目標函數，默認值：LGBMRegressor爲'regression'，LGBMClassifier爲'binary'或'multiclass'，LGBMRanker爲'lambdarank'。

num_leaves

基礎學習器的最大葉子數

learning_rate

學習率

feature_fraction

bagging_fraction

bagging_freq

verbose

use_missing
boost_from_average

（這幾個沒查到資料，有讀者知道可以評論一下，感謝）

n_jobs

並行線程數

    if job == 'classification':
        params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'binary_logloss',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 2,
        "use_missing": False,
        "boost_from_average": False,
        "n_jobs": -1
        }
    elif job == 'regression':
        params = {
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': {'l2', 'l1'},
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 2,
            "n_jobs": -1
        }
    else:
        raise Exception("job error!")
    print('Starting training...')

5. 訓練函數調用

lgb_train

訓練數據

num_boost_round=1000

梯度迭代次數

valid_sets

驗證數據集

early_stopping_rouds

當梯度停止下降多少輪，停止訓練

    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=1000,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)

6. 保存模型

    print('Saving model...')
    gbm.save_model("./model.txt")

7. 使用模型預測測試集

num_iteration=gbm.best_iteration

使用最好的模型進行預測

 y_pred_prob = gbm.predict(x_test, num_iteration=gbm.best_iteration)

8. 模型評估

需要 import 的包

from sklearn.metrics import precision_score, recall_score, roc_auc_score

調用 roc_auc_score 函數

並將驗證數據與預測的驗證數據集的結果導入，比對產生 AUC

    if job == 'classification':
        res_auc = roc_auc_score(y_test, y_pred_prob)
        print("AUC: {}".format(res_auc))
        # if res_auc < 0.75:
        #     logging.error("auc too low, maybe some error, please recheck it. AUC過低，可能訓練有誤，已終止!")
        #     sys.exit(3)
        for i in np.arange(0.1, 1, 0.1):
            print("threshold is {}: ".format(i))
            evaluation(y_test, y_pred_prob, threshold=i)
    elif job == 'regression':
        pass

evaluation 函數

輸入驗證集的標籤集和驗證集預測標籤集

比對兩者

def evaluation(y_true, y_pred_prob, threshold=0.5):
    # # eval
    # print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
    # lightgbm
    y_pred = np.where(y_pred_prob > threshold, 1, 0)

    res = precision_score(y_true, y_pred)
    print("precision_score : {}".format(res))
    res = recall_score(y_true, y_pred)
    print("recall_score : {}".format(res))
    res = roc_auc_score(y_true, y_pred_prob)
    print("roc_auc_score : {}".format(res))

precision_score =

``tp / (tp + fp)``

tp--將正類預測爲正類（true positive）

fn--將正類預測爲負類（false negative）

fp--將負類預測爲正類（false positive）

tn--將負類預測爲負類（true negative）

9. 特徵重要性

feature_importance

對每個特徵的重要性進行評估，並顯示出來

def feature_importance(gbm):
    importance = gbm.feature_importance(importance_type='gain')
    names = gbm.feature_name()
    print("-" * 10 + 'feature_importance:')
    no_weight_cols = []
    for name, score in sorted(zip(names, importance), key=lambda x: x[1], reverse=True):
        if score <= 1e-8:
            no_weight_cols.append(name)
        else:
            print('{}: {}'.format(name, score))
    print("no weight columns: {}".format(no_weight_cols))

10. 返回 gbm 模型

結束訓練

return gbm

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據處理（2.1）點擊數據處理-lgb 訓練實戰

1. 輸入參數介紹

輸入參數主要有：

2. 識別 cate_cols 是否存在，不存在則設定爲 auto

3. 轉化爲 dataset ，並建立驗證集

4. 根據 job 選擇訓練參數

boosting_type

objective

num_leaves

learning_rate

feature_fraction

bagging_fraction

bagging_freq

verbose

use_missing
boost_from_average

n_jobs

5. 訓練函數調用

lgb_train

num_boost_round=1000

valid_sets

early_stopping_rouds

6. 保存模型

7. 使用模型預測測試集

num_iteration=gbm.best_iteration

8. 模型評估

調用 roc_auc_score 函數

evaluation 函數

precision_score =

9. 特徵重要性

feature_importance

10. 返回 gbm 模型

工作中用到的腳本合集

24-5-18 X

數據處理（1）-python 正則表達式彙總

TensorfFlow2.0 (2) 超參數搜索代碼實戰

TensorFlow2.0 (1) wide and deep 模型多輸入代碼詳解

Docker 服務部署和使用

Ubuntu 16 升 18

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

數據處理（2.1）點擊數據處理-lgb 訓練實戰

1. 輸入參數介紹

輸入參數主要有：

2. 識別 cate_cols 是否存在，不存在則設定爲 auto

3. 轉化爲 dataset ，並建立驗證集

4. 根據 job 選擇訓練參數

boosting_type

objective

num_leaves

learning_rate

feature_fraction

bagging_fraction

bagging_freq

verbose

use_missing boost_from_average

n_jobs

5. 訓練函數調用

lgb_train

num_boost_round=1000

valid_sets

early_stopping_rouds

6. 保存模型

7. 使用模型預測測試集

num_iteration=gbm.best_iteration

8. 模型評估

調用 roc_auc_score 函數

evaluation 函數

precision_score =

9. 特徵重要性

feature_importance

10. 返回 gbm 模型

use_missing
boost_from_average