Kaggle Intermediate-ML

1 如何處理類別變量?

方法一:丟棄(一般不用)

方法二:LabelEncoder

from sklearn.processing import LabelEncoder
label_encoder = LabelEncoder()
X[col] = label_encoder.fit_transform(X[col])
X_val[col] = label_encoder.transform(X_val[col]

方法三:OneHotEncoding:

在這裏插入圖片描述
作用:可用來處理無序的類別特徵。
注意:當特徵類別數大於15的時候不使用該方法

from sklearn.procession import OneHotEncoding

One_H_encoder = OneHotEncoding(handle_unknown='ignore',sparse=False)
OH_cols_train = pd.DataFrame(One_H_encoder.fit_transform(X_train[object_cols])
OH_cols_val = pd.DataFrame(One_H_encoder.transform(X_val[object_cols])

OH_cols_train.index = X_train.index
OH_cols_val.index = X_val.index

num_X_train = X_train.drop(object_cols,axis=1)
num_X_val = X_val.drop(object_cols,axis=1)

OH_X_train = pd.concat([num_X_train,OH_cols_train],axis=1)
OH_X_test = pd.concat([num_X_val,OH_cols_val],axis=1)

2 Pipeline

Pipeline 好處:

  1. 讓代碼更精簡與直觀
  2. 減少出現Bug的可能性
  3. 可以批量進行

代碼:

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.processing import OneHotEncoder
from sklearn.imputer import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error

X_full = pd.read_csv('X_train.csv') 
X_test_full = pd.read_csv('X_test.csv')

X_full.dropna(axis=0,subset=['SalePrice'],inplace=True)
y = X_full.SalePrice
X_full.drop('SalePrice',axis=1)

X_train_full,X_val_full,y_train,y_vaild = train_test_split(X_full,y,test_size=0.3,random_state=0)


categorical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].nunique() <10 and X_train_full[cols].dtype == 'object']

numerical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].dtype in ['int64','float64']]

my_cols = categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_val = X_val_full[my_cols].copy()
X_test = X_test[my_cols].copy()

#Step1:
numerical_transform = SimpleImputer()

categorical_transform = Pipline(steps=[('imputer',SimpleImputer(strategy='most_frequent')),\
('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]

processor = ColumnTransformer(transformers=[('numerical',numerical_transform,numerical_cols),('cat',categorical_transform,categorical_cols))

#Step2:
model = RandomForestRegressor(n_estimators=100,random_state=0)

my_pipeline = Pipeline(steps=[('processor',processor),('model',model)])

my_pipeline.fit(X_train,y_train)
preds = my_pipeline.predict(X_val)
mean_error = mean_absolute_error(preds,y_val)

3 Cross Validation

適用於:小數據集

代碼:

from sklearn.model_selection import cross_val_error
from sklern.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.imputer import SimpleImputer

def get_score(n_estimators,X,y)
	my_pipeline = Pipeline(steps=[('Imputer',SimpleImputer(strategy='median'))
	\,('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))])
	score = -1*cross_val_score(my_pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')
	return score.mean()
	

4 XGBoost

理念:首先初始化一個弱學習器,然後用這個弱學習器進行預測並計算損失,然後根據損失訓練出一個新的學習器,將新學習器加入到大的學習器當中,然後迭代上面的步驟。

在這裏插入圖片描述

重要參數:

  1. n_estimators:學習器的數量,也可以看作是迭代的輪數,通常設爲100-1000之間,太低會欠擬合,太高會過擬合。
  2. early_stopping_rounds:若loss值幾輪未改變,就提早停止,通常設爲5,用較高的n_estimators和early_stoppint_rounds搭配是個好選擇
  3. eval_set:與early_stoppint_rounds一起搭配使用,用來計算validation score.
  4. n_jobs:當數據集很大的時候,可以設置這個參數,相當於分佈式運算。
  5. learning_rate:給每個基學習器一個權重,而非簡單相加,默認爲0.1

代碼:

from xgboost import XGBRegressor

model = XGBRegressor(n_estimators=500,learning_rate=0.01,random_state=0)
model.fit(X_train,y_trian,early_stopping_rounds=5,eval_set=[(X_valid,y_valid)],verbose=False)

5 Data leakage

1.Target Leakage:

  • 發生場景:當數據集當中包含着那些在預測時不會發揮作用的樣本時。

在這裏插入圖片描述

  • 在上面的這個數據集中可以發現,took_antibiotic_medicine的改變經常會使得got_pneumonia發生改變,利用這個數據訓練出來的模型雖然在驗證集上表現很好,但是當我們拿到現實世界去的時候往往精度非常的低。原因在於:使用這個模型的目的在於預測某位病人是否得了這個病,所以一般來看病的人,即使他們已經患上了,他們也尚未拿到藥,所以利用這個模型進行預測顯然是不準確的,因爲一些數據在預測中是不起作用的。

2.Train-Test Contamination

  • 發生場景:當我們在分離訓練集和驗證集之前,對數據集進行了填充或者歸一化。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章