【Kaggle入門】Titanic: Machine Learning from Disaster----模型優化嘗試(二)


這個系列博客純粹爲了記錄一下自己學習kaggle的相關內容,也是跟着別人一步步學習。


前一篇博客【Kaggle入門】Titanic: Machine Learning from Disaster----模型優化嘗試(一)將Embarked這個feature去掉,但是很遺憾效果變差了。本着從簡單到複雜的原則,這裏我還是嘗試一些簡單的feature增減或組合吧。

數據集裏有SibSp和Parch兩個feature,一直對這兩個feature的認識不夠,也不知道到該怎麼使用,這裏嘗試把這兩個feature重新組合成一個kinsfolk親屬feature,然後原始的兩個feature去掉。

import pandas as pd
import numpy as np
from pandas import Series,DataFrame
data_train = pd.read_csv("data/train.csv")

根據博客【Kaggle入門】Titanic: Machine Learning from Disaster----簡單數據處理將缺失數據填充值。

data_train.loc[ (data_train.Embarked.isnull()), 'Embarked' ] = 'S'
from sklearn.ensemble import RandomForestRegressor

def set_missing_ages(df):
    age_df = df[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
    
    age_notnull = age_df[age_df.Age.notnull()].as_matrix()
    age_isnull = age_df[age_df.Age.isnull()].as_matrix()
    
    y = age_notnull[:, 0]
    X = age_notnull[:, 1:]
    
    model = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    model.fit(X, y)
    
    predictions = model.predict(age_isnull[:, 1:])
    df.loc[df.Age.isnull(), 'Age'] = predictions
    
    return df, model

data_train, rfr = set_missing_ages(data_train)
def set_cabin_type(df):
    df.loc[df.Cabin.notnull(), 'Cabin'] = 'Yes'
    df.loc[df.Cabin.isnull(), 'Cabin'] = 'No'
    
    return df

data_train = set_cabin_type(data_train)
dummies_Cabin = pd.get_dummies(data_train['Cabin'], prefix='Cabin')
dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix='Embarked')
dummies_Sex = pd.get_dummies(data_train['Sex'], prefix='Sex')
dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix='Pclass')

df = pd.concat([data_train, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
import sklearn.preprocessing as preprocessing

scaler = preprocessing.StandardScaler()
np_data_age = np.array(df['Age']).reshape(-1, 1)
age_scale_param = scaler.fit(np_data_age)
df['Age_scaled'] = scaler.fit_transform(np_data_age, age_scale_param)
np_data_fare = np.array(df['Fare']).reshape(-1, 1)
fare_scale_param = scaler.fit(np_data_fare)
df['Fare_scaled'] = scaler.fit_transform(np_data_fare, fare_scale_param)
df['kinsfolk'] = df['SibSp'] + df['Parch']
from sklearn import linear_model

train_df = df.filter(regex='Survived|Age_.*|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*|kinsfolk')
train_np = train_df.as_matrix()

y = train_np[:, 0]
X = train_np[:, 1:]

clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(X, y)
data_test = pd.read_csv('data/test.csv')
data_test.loc[data_test.Fare.isnull(), 'Fare'] = 0

test_df = data_test[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
null_age = test_df[data_test.Age.isnull()].as_matrix()

X = null_age[:, 1:]
predictedAges = rfr.predict(X)
data_test.loc[data_test.Age.isnull(), 'Age'] = predictedAges

data_test = set_cabin_type(data_test)
dummies_test_Cabin = pd.get_dummies(data_test['Cabin'], prefix='Cabin')
dummies_test_Embarked = pd.get_dummies(data_test['Embarked'], prefix='Embarked')
dummies_test_Sex = pd.get_dummies(data_test['Sex'], prefix='Sex')
dummies_test_Pclass = pd.get_dummies(data_test['Pclass'], prefix='Pclass')

df_test = pd.concat([data_test, dummies_test_Cabin, dummies_test_Embarked, dummies_test_Sex, dummies_test_Pclass], axis=1)
df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
np_test_data_age = np.array(df_test['Age']).reshape(-1, 1)
df_test['Age_scaled'] = scaler.fit_transform(np_test_data_age, age_scale_param)
np_test_data_fare = np.array(df_test['Fare']).reshape(-1, 1)
df_test['Fare_scaled'] = scaler.fit_transform(np_test_data_fare, fare_scale_param)
df_test['kinsfolk'] = df_test['SibSp'] + df_test['Parch']
test = df_test.filter(regex='Age_.*|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*|kinsfolk')
predictions = clf.predict(test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].as_matrix(), 'Survived':predictions.astype(np.int32)})
result.to_csv('titanic_predictions.csv', index=False)

再次將結果提交到kaggle,0.76076,太受打擊了。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章