菜鳥入門_Python_機器學習(3)_迴歸


@sprt
*寫在開頭:博主在開始學習機器學習和Python之前從未有過任何編程經驗,這個系列寫在學習這個領域一個月之後,完全從一個入門級菜鳥的角度記錄我的學習歷程,代碼未經優化,僅供參考。有錯誤之處歡迎大家指正。
系統:win7-CPU;
編程環境:Anaconda2-Python2.7,IDE:pycharm5;
參考書籍:
《Neural Networks and Learning Machines(Third Edition)》- Simon Haykin;
《Machine Learning in Action》- Peter Harrington;
《Building Machine Learning Systems with Python》- Wili Richert;
C站裏都有資源,也有中文譯本。
我很慶幸能跟隨老師從最基礎的東西學起,進入機器學習的世界。*


這節課開始接觸‘迴歸’的概念,主要目的是對連續性的數據作出預測。迴歸可以做很多事情, C站上很多大牛的博客都有詳述,這裏不再贅述。我要做的就是從最基礎的角度來學習怎樣實現它,幾個鏈接供參考:
http://www.sohu.comblog.csdn.net/dark_scope/article/details/8558244
http://blog.csdn.net/dark_scope/article/details/8558244
http://blog.csdn.net/zouxy09/article/details/17589329(大牛專欄,其中可學的太多)

先來看我們這節課的任務:

•玩轉最小二乘法:針對以下3種真實模型(高斯白噪音(0,1),x在[-3,3]均勻取N=100個值,),採用3種最小二乘法模型(1次、2次和3次),分別擬合,畫出散點和擬合曲線(共9張圖)

•對X添加噪音(0,0.2),重複以上,觀察結果變化。
•添加(0,3)作爲觀察樣本,重複以上,觀察結果變化。
•添加(4,0)作爲觀察樣本,重複以上,觀察結果變化。
•Prove: Bayesian or MAP estimation for linear regression is equivalent to regularization of MSE
•We will give you a dataset RegData2D, use whatever you have learnt to find the best relationship between x and y. Write a report, and specifically pay attention to underfittingand overfitting and how you come up with your best solutions.

之前每次的任務只是寫好代碼跑一跑看看結果,這次課起老師讓我們學會寫報告。原話是“有的人東西做不出來,但是一份好的報告依然會令人刮目相看”。

最小二乘法進行多項式擬合是迴歸當中很常用的方法,基本思想就是使擬合曲線的偏差平方和最小(憑什麼使均方誤差最小我們就認爲達到所謂‘好’的標準了呢?好吧……老師這麼問了,不敢隨便回答),網上很多介紹,這裏也給一個定義和求導過程:


代碼如下,其實Python中有直接的最小二乘法的函數可以直接調用,但是爲了理解清楚還是手敲了一下:

import matplotlib.pyplot as plt
import numpy as np
import math

#Each point on the generating curve
x = np.arange(-3,3,0.1)

def least_square_color1(y,order,error_x,error_y):
    #The offset of each point on the generated curve
    i=0
    xa=[]
    ya=[]
    for xx in x:
        yy=y[i]
        d_x=np.random.normal(0,error_x)
        d_y=np.random.normal(0,error_y)
        i+=1
        xa.append(xx+d_x)
        ya.append(yy+d_y)

    plt.plot(xa,ya,color='m',linestyle='',marker='.')

    #Curve fitting
    matA=[]
    for i in range(0,order+1+1):
        matA1=[]
        for j in range(0,order+1+1):
            tx=0.0
            for k in range(0,len(xa)):
                dx=1.0
                for l in range(0,j+i):
                    dx=dx*xa[k]
                tx+=dx
            matA1.append(tx)
        matA.append(matA1)
    matA=np.array(matA)

    matB=[]
    for i in range(0,order+1+1):
        ty=0.0
        for k in range(0,len(xa)):
            dy=1.0
            for l in range(0,i):
                dy=dy*xa[k]
            ty+=ya[k]*dy
        matB.append(ty)
    matB=np.array(matB)

    matAA=np.linalg.solve(matA,matB)

    #Draw the curve fitting
    xxa= np.arange(-3,3,0.01)
    yya=[]
    for i in range(0,len(xxa)):
        yy=0.0
        for j in range(0,order+1+1):
            dy=1.0
            for k in range(0,j):
                dy*=xxa[i]
            dy*=matAA[j]
            yy+=dy
        yya.append(yy)
    plt.plot(xxa,yya,color='r',linestyle='-',marker='')

def least_square_color2(y,order,error_x,error_y):
    #The offset of each point on the generated curve
    i=0
    xa=[]
    ya=[]
    for xx in x:
        yy=y[i]
        d_x=np.random.normal(0,error_x)
        d_y=np.random.normal(0,error_y)
        i+=1
        xa.append(xx+d_x)
        ya.append(yy+d_y)

    plt.plot(xa,ya,color='b',linestyle='',marker='.')

    #Curve fitting
    matA=[]
    for i in range(0,order+1+1):
        matA1=[]
        for j in range(0,order+1+1):
            tx=0.0
            for k in range(0,len(xa)):
                dx=1.0
                for l in range(0,j+i):
                    dx=dx*xa[k]
                tx+=dx
            matA1.append(tx)
        matA.append(matA1)
    matA=np.array(matA)

    matB=[]
    for i in range(0,order+1+1):
        ty=0.0
        for k in range(0,len(xa)):
            dy=1.0
            for l in range(0,i):
                dy=dy*xa[k]
            ty+=ya[k]*dy
        matB.append(ty)
    matB=np.array(matB)

    matAA=np.linalg.solve(matA,matB)

    #Draw the curve fitting
    xxa= np.arange(-3,3,0.01)
    yya=[]
    for i in range(0,len(xxa)):
        yy=0.0
        for j in range(0,order+1+1):
            dy=1.0
            for k in range(0,j):
                dy*=xxa[i]
            dy*=matAA[j]
            yy+=dy
        yya.append(yy)
    plt.plot(xxa,yya,color='g',linestyle='-',marker='')

order=[1,2,3]
error_y=1
error_x_mat=[0.1,0.2]
'''
x with noise 0~0.1: pink dots, red lines
x with noise 0~0.2: blue dots, green lines
'''
for order in range(len(order)):
    for error_x in error_x_mat:

        plt.subplot(330+order+1)
        plt.ylabel('order={}'.format(order+1))
        plt.title('y=2x+1',fontsize=10)
        y = [2*a+1 for a in x]
        if error_x==0.1:
            least_square_color1(y,order,error_x,error_y)
        else:
            least_square_color2(y,order,error_x,error_y)

        plt.subplot(333+order+1)
        plt.ylabel('order={}'.format(order+1))
        plt.title('$y=0.01x^{2}+2x+1$',fontsize=10)
        y = [0.01*a*a+2*a+1 for a in x]
        if error_x==0.1:
            least_square_color1(y,order,error_x,error_y)
        else:
            least_square_color2(y,order,error_x,error_y)

        plt.subplot(336+order+1)
        plt.ylabel('order={}'.format(order+1))
        plt.title('$y=0.1e^{0.1x}+2x+1$',fontsize=10)
        y = [0.1*(math.e**(0.1*a))+2*a+1 for a in x]
        if error_x==0.1:
            least_square_color1(y,order,error_x,error_y)
        else:
            least_square_color2(y,order,error_x,error_y)

plt.legend()
plt.show()

結果如下:

按照任務要求,將九附圖畫在一張大圖中,不同點和線的顏色代表對X施加不同的噪音,y軸噪音不變。不同階數擬合曲線的差異還是很大的,當階數過大時,過擬合現象會很明顯,接下來我們詳細討論。

在原有的基礎上加入奇異值:

接下來的任務中要擬合所謂的‘RegData2D’數據集,其實畫出來就是下面這個樣子,主觀感覺二階多項式足以(就是紅色那條):

但還是按照要求一步一步來……貝葉斯最大後驗概率和最大似然估計的思想在我的上一篇博文中都經過仔細的推導,這裏直接上代碼:

首先是最大似然估計,仿照《Building Machine Learning Systems with Python》- Wili Richert這本書中最開始的例子,發現用自帶庫確實比之前自己寫要高效的多得多:

# -*- coding:gb2312 -*-
from numpy import *
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
#txt中所有字符串讀入data
a = []
with open('RegData2D.txt', 'r') as f:
    data = f.readlines()
    for line in data:
        odom = line.split()
        numbers_float = map(float, odom)
        a.append(numbers_float)
a = np.array(a)
x_train = []
y_train = []
x_train.append(a[0:500, 0])
y_train.append(a[0:500:, 1])
x_train = np.array(x_train[0])
y_train = np.array(y_train[0])
x_text = []
y_text = []
x_text.append(a[501:, 0])
y_text.append(a[501:, 1])
x_text = np.array(x_text[0])
y_text = np.array(y_text[0])

# draw data
plt.figure(1)
plt.scatter(x_train, y_train, s=5)
#
def error(f, x, y):
    return sp.sum((f(x) - y) ** 2)
# 多項式擬合

# 1次
fp1, res1, rank1, sv1, rcond1 = sp.polyfit(x_train, y_train, 1, full=True)
f1 = sp.poly1d(fp1)
# print f1
# 2次
fp2, res2, rank2, sv2, rcond2 = sp.polyfit(x_train, y_train, 2, full=True)
f2 = sp.poly1d(fp2)
# print f2
# 3次
fp3, res3, rank3, sv3, rcond3 = sp.polyfit(x_train, y_train, 3, full=True)
f3 = sp.poly1d(fp3)
# print f3
# 10次
fp10, res10, rank10, sv10, rcond10 = sp.polyfit(x_train, y_train, 10, full=True)
f10 = sp.poly1d(fp10)
# print f10
# 100次
fp100, res100, rank100, sv100, rcond100 = sp.polyfit(x_train, y_train, 100, full=True)
f100 = sp.poly1d(fp100)
# print f100

fx = sp.linspace(-1, 5, 1000)
plt.plot(fx, f1(fx), linewidth=1)
#plt.legend("%d" % f1.order, loc="upper left")
plt.plot(fx, f2(fx), linewidth=1)
#plt.legend("d= %d" % f2.order, loc="upper left")
plt.plot(fx, f3(fx), linewidth=1)
#plt.legend("d= %d" % f3.order, loc="upper left")
plt.plot(fx, f10(fx), linewidth=1)
#plt.legend("d= %d" % f10.order, loc="upper left")
plt.plot(fx, f100(fx), linewidth=1)
#plt.legend("d= %d" % f100.order, loc="upper left")

# 畫出error曲線
plt.figure(2)
error_train = []
error_text = []
order = []
for i in range(3):
    order.append(i + 1)
for j in order:
    fp, res, rank, sv, rcond = sp.polyfit(x_train, y_train, j, full=True)
    f = sp.poly1d(fp)
    #print("Train Error of the model:", {j}, error(f, x_train, y_train))
    print("Text Error of the model:", {j}, error(f, x_text, y_text))
    error_numtrain = error(f, x_train, y_train)
    error_numtext = error(f, x_text, y_text)
    error_train.append(error_numtrain)
    error_text.append(error_numtext)
plt.plot(order, error_train, 'r--')
plt.plot(order, error_text, 'g-')
plt.title("Red for train error. Green for text error")

plt.show()

我也是懶得可以,建個列表寫循環都覺得麻煩……真是醜陋啊。結果分析如下:

[1]. Underfitting的情況,當多項式階數取1時:

Correctfitting的情況,多項式階數取2時:

Overfitting的情況,多項式階數取100時:

由於RegData2D數據集的規律性,只有當order = 1時迴歸曲線Underfitting的情況十分明顯,之後階數增加,迴歸曲線的變化並不明顯,當order > 100時Overfitting變得比較明顯,下圖中可以明顯看出,order = 2,3,10的迴歸曲線幾乎是重合在一起的:

[2]. 取RegData2D數據集中前500個樣本做train_data,後500做text_data,多項式階數從1取至100,分別畫出誤差變化曲線:

可以明顯看出,僅在order = 1時,均方誤差很大,之後變化不明顯,測試誤差大於訓練誤差。當階數取值更高時,可以看出從order = 2起訓練誤差逐漸降低,測試誤差逐漸升高:

結論:order = 2時得到最優迴歸曲線:y = 0.5581 x^2 - 1.056 x - 0.9284

接下來在最大似然估計的基礎上加入後驗信息:w = (X^T * X + a*I)^(-1) * X^T * y,發現只要在我們一開始的最小二乘法上加一點東西就好,對a取任意值,找到使訓練誤差最小的a值:

# -*- coding:gb2312 -*-
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp

class regression(object):

    def __init__(self, data_x, data_y, data_z, order, x_left_limit, x_right_limit, a):
        self.data_x = data_x
        self.data_y = data_y
        self.data_z = data_z
        self.order = order
        self.x_left_limit = x_left_limit
        self.x_right_limit = x_right_limit
        self.a = a

    # 嶺迴歸
    def least_square(self):
        matA = []
        for i in range(0, self.order + 1):
            matA1 = []
            for j in range(0, self.order + 1):
                tx = 0.0
                for k in range(0, len(self.data_x)):
                    dx = 1.0
                    for l in range(0, j + i):
                        dx = dx * self.data_x[k]
                    tx += dx
                matA1.append(tx)
            matA.append(matA1)
        matA = np.array(matA)


        # print matA.shape

        matB = []
        for i in range(0, self.order + 1):
            ty = 0.0
            for k in range(0, len(self.data_x)):
                dy = 1.0
                for l in range(0, i):
                    dy = dy * self.data_x[k]
                ty += self.data_y[k] * dy
            matB.append(ty)
        matB = np.array(matB)

        I = np.identity(self.order + 1)

        matAA = np.linalg.solve(matA + self.a * I, matB)
        # print matAA

        def error(f, x, y):
            return sp.sum((f(x) - y) ** 2)
        matAAlist = list(matAA)
        matAAlist.reverse()
        matAAuse = np.array(matAAlist)
        f = sp.poly1d(matAAuse)
        print("Train Error of the model:", {self.a}, error(f, self.data_x, self.data_y))

        # Draw the curve fitting
        xxa = np.arange(self.x_left_limit, self.x_right_limit, 0.01)
        yya = []
        for i in range(0, len(xxa)):
            yy = 0.0
            for j in range(0, self.order + 1):
                dy = 1.0
                for k in range(0, j):
                    dy *= xxa[i]
                dy *= matAA[j]
                yy += dy
            yya.append(yy)
        # plt.plot(xxa, yya, 'r-', linewidth = 2.0)


        return matAA, xxa, yya, error(f, self.data_x, self.data_y)
# -*- coding:gb2312 -*-
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from class_regression import *

# 打印出RegData2D數據集中的隨機樣本
def print_RegData2D():
    data = sp.genfromtxt("RegData2D.txt")
    RegData2D_x = data[:, 0]
    RegData2D_y = data[:, 1]
    plt.scatter(RegData2D_x, RegData2D_y)
    plt.xlabel("X")
    plt.ylabel("Y")
    x_min = min(RegData2D_x)
    x_max = max(RegData2D_x)

    return RegData2D_x, RegData2D_y, x_min, x_max

# 最小二乘法
x, y, x_min, x_max = print_RegData2D()



a = []
for i in range(1):
    a.append(i)
error_a = []
for j in range(len(a)):
    RegData2D = regression(x, y, [], 1, x_min, x_max, a[j])
    pra, fx, fy, error = RegData2D.least_square()
    error_a.append(error)

    plt.plot(fx, fy, 'r-', linewidth = 2)

plt.figure(2)
plt.plot(a, error_a, 'r-')
'''
    if order[i] == 2:
        plt.plot(fx, fy, 'k-', linewidth = 1.0)


for i in range(len(order)):
    RegData2D = regression(x, y, [], order[i], x_min, x_max, 100)
    pra, fx, fy = RegData2D.least_square()
    if order[i] == 1:
        plt.plot(fx, fy, 'r-', linewidth = 1.0)
    elif order[i] == 2:
        plt.plot(fx, fy, 'r-', linewidth = 1.0)
    elif order[i] == 3:
        plt.plot(fx, fy, 'g-', linewidth = 2.0)
    elif order[i] == 5:
        plt.plot(fx, fy, 'y-', linewidth = 2.0)
    elif order[i] == 10:
        plt.plot(fx, fy, 'k-', linewidth = 2.0)

    elif order[i] == 100:
        plt.plot(fx, fy, 'w-', linewidth = 2.0)
    '''
plt.show()

取RegData2D數據集全部作爲訓練集,取order = 2,a = [0, 100],可以看出a的取值對迴歸曲線的影響不大:

迴歸曲線幾乎重合在一起,訓練集的誤差曲線見下圖,a = 0時即最大似然估計:

結論:對於RegData2D數據集,order = 2,a = 0時得到最優的迴歸曲線。

單純從多項式擬合的角度去處理數據集,得到的多項式與老師所給的式子差距甚遠,還需要很多方法來進一步縮小誤差和細化函數,我們之後討論。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章