1 如何處理類別變量?
方法一:丟棄(一般不用)
方法二:LabelEncoder
from sklearn.processing import LabelEncoder
label_encoder = LabelEncoder()
X[col] = label_encoder.fit_transform(X[col])
X_val[col] = label_encoder.transform(X_val[col]
方法三:OneHotEncoding:
作用:可用來處理無序的類別特徵。
注意:當特徵類別數大於15的時候不使用該方法
from sklearn.procession import OneHotEncoding
One_H_encoder = OneHotEncoding(handle_unknown='ignore',sparse=False)
OH_cols_train = pd.DataFrame(One_H_encoder.fit_transform(X_train[object_cols])
OH_cols_val = pd.DataFrame(One_H_encoder.transform(X_val[object_cols])
OH_cols_train.index = X_train.index
OH_cols_val.index = X_val.index
num_X_train = X_train.drop(object_cols,axis=1)
num_X_val = X_val.drop(object_cols,axis=1)
OH_X_train = pd.concat([num_X_train,OH_cols_train],axis=1)
OH_X_test = pd.concat([num_X_val,OH_cols_val],axis=1)
2 Pipeline
Pipeline 好處:
- 讓代碼更精簡與直觀
- 減少出現Bug的可能性
- 可以批量進行
代碼:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.processing import OneHotEncoder
from sklearn.imputer import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error
X_full = pd.read_csv('X_train.csv')
X_test_full = pd.read_csv('X_test.csv')
X_full.dropna(axis=0,subset=['SalePrice'],inplace=True)
y = X_full.SalePrice
X_full.drop('SalePrice',axis=1)
X_train_full,X_val_full,y_train,y_vaild = train_test_split(X_full,y,test_size=0.3,random_state=0)
categorical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].nunique() <10 and X_train_full[cols].dtype == 'object']
numerical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].dtype in ['int64','float64']]
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_val = X_val_full[my_cols].copy()
X_test = X_test[my_cols].copy()
#Step1:
numerical_transform = SimpleImputer()
categorical_transform = Pipline(steps=[('imputer',SimpleImputer(strategy='most_frequent')),\
('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]
processor = ColumnTransformer(transformers=[('numerical',numerical_transform,numerical_cols),('cat',categorical_transform,categorical_cols))
#Step2:
model = RandomForestRegressor(n_estimators=100,random_state=0)
my_pipeline = Pipeline(steps=[('processor',processor),('model',model)])
my_pipeline.fit(X_train,y_train)
preds = my_pipeline.predict(X_val)
mean_error = mean_absolute_error(preds,y_val)
3 Cross Validation
適用於:小數據集
代碼:
from sklearn.model_selection import cross_val_error
from sklern.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.imputer import SimpleImputer
def get_score(n_estimators,X,y)
my_pipeline = Pipeline(steps=[('Imputer',SimpleImputer(strategy='median'))
\,('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))])
score = -1*cross_val_score(my_pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')
return score.mean()
4 XGBoost
理念:首先初始化一個弱學習器,然後用這個弱學習器進行預測並計算損失,然後根據損失訓練出一個新的學習器,將新學習器加入到大的學習器當中,然後迭代上面的步驟。
重要參數:
- n_estimators:學習器的數量,也可以看作是迭代的輪數,通常設爲100-1000之間,太低會欠擬合,太高會過擬合。
- early_stopping_rounds:若loss值幾輪未改變,就提早停止,通常設爲5,用較高的n_estimators和early_stoppint_rounds搭配是個好選擇
- eval_set:與early_stoppint_rounds一起搭配使用,用來計算validation score.
- n_jobs:當數據集很大的時候,可以設置這個參數,相當於分佈式運算。
- learning_rate:給每個基學習器一個權重,而非簡單相加,默認爲0.1
代碼:
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=500,learning_rate=0.01,random_state=0)
model.fit(X_train,y_trian,early_stopping_rounds=5,eval_set=[(X_valid,y_valid)],verbose=False)
5 Data leakage
1.Target Leakage:
- 發生場景:當數據集當中包含着那些在預測時不會發揮作用的樣本時。
- 在上面的這個數據集中可以發現,took_antibiotic_medicine的改變經常會使得got_pneumonia發生改變,利用這個數據訓練出來的模型雖然在驗證集上表現很好,但是當我們拿到現實世界去的時候往往精度非常的低。原因在於:使用這個模型的目的在於預測某位病人是否得了這個病,所以一般來看病的人,即使他們已經患上了,他們也尚未拿到藥,所以利用這個模型進行預測顯然是不準確的,因爲一些數據在預測中是不起作用的。
2.Train-Test Contamination
- 發生場景:當我們在分離訓練集和驗證集之前,對數據集進行了填充或者歸一化。