上手機器學習系列-第5篇（上）-XGBoost

引言

在第4篇中我們簡要介紹了GBDT的使用方法，本篇來聊一種進階的GBDT-XGBoost，這是一款極其優秀的算法解決方案，項目官方網站：https://xgboost.readthedocs.io/en/latest/。

XGBoost本身最大的優勢就在於它在工程上做了大量的工作，使得該算法可以廣泛應用到不同的生態中（python、spark、C++等）。

Python + XGBoost

對官方DEMO的解剖

以下是XGBoost官網上給出來的一個DEMO代碼：

import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)

這裏的數據集顯然是需要從xgboost軟件附帶的文件去下載，可以按照官網上說的那樣把整個xgboost都git clone到本地（但這個有時候會因爲網速的原因，不一定能順利下載到），也可以到github上（https://github.com/dmlc/xgboost/tree/master/demo/data）去找到對應的目錄下的數據，直接複製粘貼到本地文件中，反正怎麼高效怎麼來。

首先關注一下數據集的格式：

顯然，這是已經預處理好的很乾淨的數據。第一列是目標變量（這一點跟很多數據集不同，一般情況下目標變量是在最後一列，不過沒關係，核心還是要看讀取數據時是怎麼操作的，只要能正確解析就行，一會我們再去看它的代碼實現細節）。其次，我們看到這裏特徵變量是用k:v這樣的格式來表達，其實這是非常常用的一種格式，稱爲dense向量，與之相對的稱爲sparse向量，我們可以通過下圖的示意來理解：

那上面這些120多個特徵變量，分別是啥呢？我們可通過github上xgboost/demo/data/featmap.txt 這個文件來了解：

從這個文件的內容，我們可以推測出來它是對特徵進行了one-hot-encoding。很多機器學習算法都不支持具有枚舉值的變量，所以需要做這樣的一個特徵工程處理，原理如下圖所示：

這樣我們就對數據有了初步瞭解。接着去看代碼操作。

在上面的代碼中是用.Dmatrix讀入數據，我們自己用 type(dtrain) 查看一下，發現返回的是xgboost.core.DMatrix這樣一種數據類型，看來是xgboost自己定義的數據格式了。那麼問題來了，
dtrain = xgb.DMatrix(‘demo/data/agaricus.txt.train’)
這一步操作之後，是把特徵變量與目標變量保持到了一個數據集中，這與我們以往使用的sklearn接口是不同的，還記得嗎，我們通常是把數據集拆成X_train， y_train 這樣，有特徵集與目標變量之分，之後再調用.fit(X_train, y_train)這樣的接口來擬合數據。而這裏我們看到是這樣用的：
bst = xgb.train(param, dtrain, num_round)
train 我們可以認爲等同於sklearn裏面算法包接口的fit方法，但這裏只傳了一個dtrain數據集，並沒有拆分開哪一列是目標變量，train這個方法是怎麼知道用哪一列來擬合呢？我們猜想train這個方法一定與DMatrix數據結構有約定好的格式。爲了解這裏究竟是怎麼處理的（沒錯，就是要多保留一些好奇心，這樣寫代碼纔有趣），我們去翻翻github上xgboost的源代碼。

找啊找，我們在文件dmlc/xgboost/blob/master/python-package/xgboost/core.py中看到了DMatrix這個類的定義：

可見這裏其實是支持傳入目標向量的，但可以默認爲空，那麼猜想如果傳入爲空，它一定默認從前面的data中去提取這個目標向量了。

--------------以下過程不感興趣的可以跳過————————
繼續往下看，發現這樣一個線索：

if label is not None:
            self.set_label(label)

看來，當傳入的label不爲空時，通過set_label 這個方法就直接獲取到了label, 那麼label未傳入呢？但是我們居然沒有找到 if label is None 這個邏輯的處理！呃，一定是在某個地方隱式處理了，繼續找線索：

  def get_label(self):
        """Get the label of the DMatrix.
        Returns
        -------
        label : array
        """
        return self.get_float_info('label')

....

    def get_float_info(self, field):
        """Get float property from the DMatrix.
        Parameters
        ----------
        field: str
            The field name of the information
        Returns
        -------
        info : array
            a numpy array of float information of the data
        """
        length = c_bst_ulong()
        ret = ctypes.POINTER(ctypes.c_float)()
        _check_call(_LIB.XGDMatrixGetFloatInfo(self.handle,
                                               c_str(field),
                                               ctypes.byref(length),
                                               ctypes.byref(ret)))
        return ctypes2numpy(ret, length.value, np.float32)

走到這裏，發現_LIB.XGDMatrixGetFloatInfo其實是一個C的函數了，在xgboost/src/c_api/c_api.cc文件中找到了定義：

XGB_DLL int XGDMatrixGetFloatInfo(const DMatrixHandle handle,
                                  const char* field,
                                  xgboost::bst_ulong* out_len,
                                  const bst_float** out_dptr) {
  API_BEGIN();
  CHECK_HANDLE();
  const MetaInfo& info = static_cast<std::shared_ptr<DMatrix>*>(handle)->get()->Info();
  const std::vector<bst_float>* vec = nullptr;
  if (!std::strcmp(field, "label")) {
    vec = &info.labels_.HostVector();
...
}

當獲取label時，返回的是 &info.labels_.HostVector(), 而info來自於handle，所以還是要去找這個對象。

if isinstance(data, (STRING_TYPES, os_PathLike)):
            handle = ctypes.c_void_p()
            _check_call(_LIB.XGDMatrixCreateFromFile(c_str(os_fspath(data)),
                                                     ctypes.c_int(silent),
                                                     ctypes.byref(handle)))
            self.handle = handle

...

int XGDMatrixCreateFromFile(const char *fname,
                            int silent,
                            DMatrixHandle *out) {
  API_BEGIN();
  bool load_row_split = false;
  if (rabit::IsDistributed()) {
    LOG(CONSOLE) << "XGBoost distributed mode detected, "
                 << "will split data among workers";
    load_row_split = true;
  }
  *out = new std::shared_ptr<DMatrix>(DMatrix::Load(fname, silent != 0, load_row_split));
  API_END();
}

...

#以下略寫了，僅保留我們找到的線索

# src/data/data.cc
std::unique_ptr<dmlc::Stream> fi(dmlc::Stream::Create(fname.c_str(), "r", true));
common::PeekableInStream is(fi.get());
        DMatrix* dmat = new data::SimpleDMatrix(&is);



...

#src/data/simple_dmatrix.cc

SimpleDMatrix::SimpleDMatrix(dmlc::Stream* in_stream) {
  int tmagic;
  CHECK(in_stream->Read(&tmagic, sizeof(tmagic)) == sizeof(tmagic))
      << "invalid input file format";
  CHECK_EQ(tmagic, kMagic) << "invalid format, magic number mismatch";
  info.LoadBinary(in_stream);
  in_stream->Read(&sparse_page_.offset.HostVector());
  in_stream->Read(&sparse_page_.data.HostVector());
}





#src/data/data.cc

void MetaInfo::LoadBinary(dmlc::Stream *fi) {
  auto version = Version::Load(fi);
  auto major = std::get<0>(version);
  // MetaInfo is saved in `SparsePageSource'.  So the version in MetaInfo represents the
  // version of DMatrix.
  CHECK_EQ(major, 1) << "Binary DMatrix generated by XGBoost: "
                     << Version::String(version) << " is no longer supported. "
                     << "Please process and save your data in current version: "
                     << Version::String(Version::Self()) << " again.";

  const uint64_t expected_num_field = kNumField;
  uint64_t num_field { 0 };
  CHECK(fi->Read(&num_field)) << "MetaInfo: invalid format";
  CHECK_GE(num_field, expected_num_field)
    << "MetaInfo: insufficient number of fields (expected at least " << expected_num_field
    << " fields, but the binary file only contains " << num_field << "fields.)";
  if (num_field > expected_num_field) {
    LOG(WARNING) << "MetaInfo: the given binary file contains extra fields which will be ignored.";
  }

  LoadScalarField(fi, u8"num_row", DataType::kUInt64, &num_row_);
  LoadScalarField(fi, u8"num_col", DataType::kUInt64, &num_col_);
  LoadScalarField(fi, u8"num_nonzero", DataType::kUInt64, &num_nonzero_);
  LoadVectorField(fi, u8"labels", DataType::kFloat32, &labels_);
  LoadVectorField(fi, u8"group_ptr", DataType::kUInt32, &group_ptr_);
  LoadVectorField(fi, u8"weights", DataType::kFloat32, &weights_);
  LoadVectorField(fi, u8"base_margin", DataType::kFloat32, &base_margin_);
}

...
template <typename T>
void LoadScalarField(dmlc::Stream* strm, const std::string& expected_name,
                     xgboost::DataType expected_type, T* field) {
  const std::string invalid {"MetaInfo: Invalid format. "};
  std::string name;
  xgboost::DataType type;
  bool is_scalar;
  CHECK(strm->Read(&name)) << invalid;
  CHECK_EQ(name, expected_name)
      << invalid << " Expected field: " << expected_name << ", got: " << name;
  CHECK(strm->Read(&type)) << invalid;
  CHECK(type == expected_type)
      << invalid << "Expected field of type: " << static_cast<int>(expected_type) << ", "
      << "got field type: " << static_cast<int>(type);
  CHECK(strm->Read(&is_scalar)) << invalid;
  CHECK(is_scalar)
    << invalid << "Expected field " << expected_name << " to be a scalar; got a vector";
  CHECK(strm->Read(field, sizeof(T))) << invalid;
}

到這裏，我們看到從文件讀入數據，然後寫入了數據對象的屬性中。
------------------結束跳過————————

再向後，bst.predict輸出了預測的結果，但是我們打印出來發現它其實是個概率值，可以自己設定閾值。

綜上，xgboost原生的數據結構和調用接口與我們常見的sklearn算法包並不同，使用起來也不是很方便。不過，它已經封裝了給sklearn用的類對象。下面我們快速過一下。

sklearn版本的使用

在https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py 裏封裝了給sklearn使用的類。
我們下面給出一個完整的代碼樣例，大家從中可以看到又回到了sklearn那些熟悉的味道：

from sklearn.model_selection import KFold, train_test_split, GridSearchCV
import xgboost as xgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

X,y = load_breast_cancer(return_X_y=True)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state = 0) #test_ratio指定測試集數據量的佔比,random_state是爲了保證實驗可復現，如果不指定數值，下一次運行隨機拆分的數據未必就是本次的結果了

xgb_model = xgb.XGBClassifier().fit(X_train, y_train)
y_predict = xgb_model.predict(X_test)
print("準確率： %f" % accuracy_score(y_predict,y_test))

結語

本篇我們從實踐的角度上走讀了一下xgboost的使用方法，部分內容稍做了展開。但終究對算法原理涉及太少，因此下篇我們將一起讀一下XGBoost的原創論文，敬請繼續關注本公衆號內容。

微信掃一掃，關注本人公衆號：

上手機器學習系列-第5篇（上）-XGBoost

引言

Python + XGBoost

對官方DEMO的解剖

sklearn版本的使用

結語

上手機器學習系列-第3篇（下）-聊聊logistic迴歸

上手機器學習系列-第5篇（上）-XGBoost

上手機器學習系列-第2篇-工具&學習資料準備篇

上手機器學習系列-第3篇（中）-聊聊logistic迴歸

上手機器學習系列-第5篇（中）-XGBoost+Scala/Spark

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結