監督學習應用--Titanic數據建模

`監督學習建模流程`

監督學習包括數據採集、數據處理、模型搭建及訓練、模型測試、參數調優等階段。這是一個不斷反饋不斷循環的過程。

本文我們利用Titanic數據進行建模，分析乘客各種信息與是否獲救之間的規律，從來對我們的測試集進行預測。（是否獲救就是標籤，乘客其他信息就是特徵。）
*數據來源*
Titanic數據是機器學習競賽平臺Kaggle平臺入門最好的數據，記錄了沉船的泰坦尼克各種乘客信息以及最終是否獲救的情況。導入數據：

`df=pd.read_csv('../p1_demo/titanic_dataset.csv',na_values='NULL')

*可視化分析*
在數據處理之前可通過對數據進行可視化分析，對數據有個很好的瞭解和把握。

##可視化獲救與沒被獲救的比例
df['survived'].hist()

###不同船艙級別的獲救情況
pd.crosstab(df['pclass'],df['survived']).plot(kind='bar')
pd.crosstab(df['pclass'],df['survived'])

*數據處理*
包括異常值處理、缺失值填充、噪音數據過濾、數據形式轉換、特徵提取等操作。
以上Titanic數據算法不能直接識別，並且像年齡費用爲連續型數據，我們需要處理成和其他特徵一致的離散性特徵。所以我們進行如下處理：

詳細處理代碼如下：

##姓名的處理
df['Name_length']=df['name'].apply(len)

##兄弟姐妹與配偶、父母小孩
df['FamilySize']=df['sibsp']+df['parch']+1
df['IsAlone']=0
df.loc[df['FamilySize']==1,'IsAlone']=1

##費用的處理
df['fare']=df['fare'].fillna(df['fare'].median())
df['categoricalFare']=pd.cut(df['fare'],4)
#print(train)

##年齡
age_avg=df['age'].mean()
age_std=df['age'].std()
age_null_count=df['age'].isnull().sum()
age_null_random_list=np.random.randint(age_avg-age_std,age_avg+age_std,size=age_null_count)
df['age'][np.isnan(df['age'])]=age_null_random_list
df['age'] = df['age'].astype(int)
df['categoricalage'] = pd.cut(df['age'], 5)


##對姓名的處理
def get_title(name):
    title_search = re.search('([A-Za-z]+)\.', name)  ##匹配出頭銜（Miss，Master 等等）
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

df['Title'] = df['name'].apply(get_title)
df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')


# 將頭銜映射到0，1，2，3，4，5
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
df['Title'] = df['Title'].map(title_mapping)
df['Title'] = df['Title'].fillna(0)

# Mapping Sex
df['sex'] = df['sex'].map({'female': 0, 'male': 1} ).astype(int)

# Mapping Fare
df.loc[ df['fare'] <= 7.91, 'fare'] = 0
df.loc[(df['fare'] > 7.91) & (df['fare'] <= 14.454), 'fare'] = 1
df.loc[(df['fare'] > 14.454) & (df['fare'] <= 31), 'fare']   = 2
df.loc[ df['fare'] > 31, 'fare']  = 3
df['fare'] = df['fare'].astype(int)

# Mapping Age
df.loc[ df['age'] <= 16, 'age']  = 0
df.loc[(df['age'] > 16) & (df['age'] <= 32), 'age'] = 1
df.loc[(df['age'] > 32) & (df['age'] <= 48), 'age'] = 2
df.loc[(df['age'] > 48) & (df['age'] <= 64), 'age'] = 3
df.loc[ df['age'] > 64, 'age'] = 4 ;

*數據拆分*
簡單拆分：將數據拆分爲訓練數據和測試數據。訓練數據用來給算法進行訓練學習，測試數據用來對訓練好的模型進行測試。
k折交叉驗證拆分：在簡單拆分的基礎上，對訓練數據繼續拆分成k份，每次拿出一份，剩下的k-1份拿來給算法進行訓練，得到k個模型，通過對k個模型同時對測試集進行測試，並得到最終的預測結果。

拆分代碼：

##將數據轉換成算法識別的數據,array的形式。

x=df[["pclass","sex","age","fare","Name_length","FamilySize","IsAlone","Title"]].values
y=df["survived"].values
PassengerId=list(df.index)
print(x.shape)
from sklearn.cross_validation import KFold

###訓練集：測試集合=1200：109
train_x=x[:1200]
train_y=y[:1200]
###print(train_x.shape)
test_x=x[1200:]
test_y=y[1200:]

test_PassengerId=PassengerId[1200:]

###獲取訓練數據量大小，測試數據量大小
ntrain=train_x.shape[0]
ntest=test_x.shape[0]

###交叉驗證數據的分類,5折交叉驗證
NFOLDS = 5 
SEED=0
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)
##print(train_x)

*算法選擇與模型訓練*
sklearn包含了各種常用的機器學習算法。直接通過一個函數就可以調用。比如選擇邏輯迴歸：

from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()

交叉驗證訓練過程：

###交叉驗證
def get_oof(clf, train_x, train_y, test_x):
    """初始化
    oof_train: 訓練數據結果
    oof_test: 測試數據結果
    oof_test_skf: 儲存5次交叉訓練模型對測試數據的預測結果"""
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))  ##生成一個NFOLDS*ntest的空數組。

    ##根據每次切分數據進行訓練並對測試數據進行預測
    for i, (train_index, test_index) in enumerate(kf):
        x_tr = train_x[train_index]
        y_tr = train_y[train_index]
        x_te = train_x[test_index]

        clf.fit(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(test_x)

    ##按第0維求平均，得到5次預測的平均。
    oof_test[:] = oof_test_skf.mean(axis=0)  

    ###返回驗證集合、測試集合的預測結果。
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

*用模型對測試集進行預測*

oof_train, oof_test = get_oof(clf,train_x, train_y, test_x)
def process_pre(pre_y):
    for i in range(pre_y.shape[0]):
        if pre_y[i][0]>0.5:
            pre_y[i][0]=1
        else:
            pre_y[i][0]=0
    pre_y=pre_y.ravel()
    return pre_y

##模型在測試集合的預測結果
pre_y=process_pre(oof_test)

*模型性能分析*
1、通過直觀地展示預測結果與實際的對比

plt.figure(figsize=(15,5))
plt.scatter(test_PassengerId,pre_y,color='green')
plt.scatter(test_PassengerId,test_y,color='blue')
plt.title('Predicted value VS True value')

2、用度量指標對模型進行性能分析
1. 準確率：所有樣本中預測正確的佔比
針對我們感興趣的類別:
2. 精度(查準率)：預測結果中預測正確的能力。
3. 召回率（查全率）:從實際類別中找到某一類別的能力。
4. F1-score=：平衡精度與召回率的度量指標。

from sklearn.metrics import classification_report,confusion_matrix

n_classes=2
##準確率
accuracy=np.mean(pre_y==test_y)
print('accuracy:'+str(accuracy))
##混淆矩陣
cmx=confusion_matrix(test_y,pre_y)
print(cmx)
##精度和召回率
performance=classification_report(test_y,pre_y,labels=range(n_classes))
print('Performance :'+'\n'+str(performance))

最後說明：在數據量這麼少的情況下，準確率能達到80%以上已經非常好了。雖然Kaggle平臺上很多很好的能到100%。那是因爲用了很多特殊的技巧，這種技巧需要根據具體數據、業務、環境下才能使用，我們本文的目的是讓大家對整個建模流程有個瞭解和把握。

監督學習應用--Titanic數據建模

`監督學習建模流程`

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

一文讀懂卷積神經網絡（CNN）

卷積神經網絡在ImageNet項目中的演進過程

深度學習框架對比

神經網絡入門必備知識

三種常見的神經網絡

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結