監督學習 | 非線性迴歸之多項式迴歸原理及Sklearn實現

原創

2020-06-20 14:58

文章目錄

機器學習 | 目錄

機器學習 | 迴歸評估指標

監督學習 | 線性迴歸之多元線性迴歸原理及Sklearn實現

監督學習 | 線性迴歸之正則線性模型原理及Sklearn實現

監督學習 | 線性分類之Logistic迴歸原理及Sklearn實現

1. 多項式迴歸

對於非線性數據，也可以用線性模型來擬合。一個簡單的方法就是將每個特徵的冪次方添加爲一個新特徵，然後在這個拓展多的特徵集上訓練線性模型。這種方法被稱爲多項式迴歸。

迴歸模型

$y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\varepsilon_i \tag{1}$

稱爲一元二階（或一元二次）多項式模型，其中， $i=1,2,\cdots,n$ 。

爲了反應迴歸係數所對應的自變量次數，我們通常將多項式迴歸模型中的係數表示稱下面模型中的情形：

$y_i=\beta_0+\beta_1x_i+\beta_{11}x_i^2+\varepsilon_i \tag{2}$

模型式 (2) 的迴歸函數 $y_i=\beta_0+\beta_1x_i+\beta_{11}x_i^2$ 是一條拋物線，通常稱稱爲二項式迴歸函數。迴歸係數 $\beta_1$ 稱爲線性效應係數， $\beta_{11}$ 爲二次效應係數。

相應地，迴歸模型

$y_i=\beta_0+\beta_1x_i+\beta_{11}x_i^2+\beta_{111}\varepsilon_i \tag{3}$

稱爲一元三次多項式模型。^[1]

2. Sklearn 實現

對於非線性的數據，我們將利用 sklearn.preprocessing.PolynomialFeatures 將非線性數據通過多項式變換爲線性數據，然後就可以重複監督學習 | 線性迴歸之多元線性迴歸原理及Sklearn實現中的方法完成迴歸。

PolynomialFeatures(degree=2, interaction_only=False, include_bias=True, order=‘C’)

參數設置：

degree: integer

The degree of the polynomial features. Default = 2.

interaction_only: boolean, default = False

If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.).

include_bias: boolean

If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).

order: str in {‘C’, ‘F’}, default ‘C’

Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators

方法：

powers_: array, shape (n_output_features, n_input_features)

powers_[i, j] is the exponent of the jth input in the ith output.

n_input_features_: int

The total number of input features.

n_output_features_: int

The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

首先基於二項式迴歸函數製造一些非線性數據（並添加隨機噪聲）。

import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt

np.random.seed(42)

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.show()

圖1 生成的非線性帶噪聲數據集

顯然，直線永遠不可能擬合這個數據。所以我們使用 PolynomialFeatures 類來對訓練數據進行轉換，將每個特徵的平方（二次多項式）作爲新特徵加入訓練集（這個例子中只有一個特徵）：

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]

array([-0.75275929])

X_poly[0]

array([-0.75275929,  0.56664654])

X_poly 現在包含原本的特徵 X 和該特徵的平方。現在對這個拓展後的特徵集匹配一個 LinearRegression 模型。

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

(array([1.78134581]), array([[0.93366893, 0.56456263]]))

還不錯，模型預估 $\hat{y}=0.56x_1^2+0.93x_11+1.78$ ，而實際上原來的函數是 $y=0.5x_1^2+1.0x_1+2.0+高斯噪聲$ 。

注意，當存在多個特徵時，多項式迴歸能夠發現特徵和特徵之間的關係（純線性迴歸模型做不到這一點）。這是因爲 PolynomialFeatures 會在給定的多項式階數下，添加所有特徵組合。這是因爲 PolynomialFeatures 會在給定的多項式階數下，添加所有特徵組合（interaction_only = False）。例如，有兩個特徵 a 和 b ，階數 degree=3，PolynomialFeatures 不會只添加特徵 $a^2,a^3,b^2和b^3$ ，還會添加組合 $ab,a^2b$ 以及 $ab^2$ 。^[2]

X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])
plt.show()

PolynomialFeatures(degree=d) 可以將一個包含 $n$ 個特徵的數組爲包含 $\frac{n+d}{d!n!}$ 個特徵的數組。

參考資料

[1] 何曉羣. 應用迴歸分析（R語言版）[M]. 北京: 電子工業出版社, 2018: 203-204.

[2] Aurelien Geron, 王靜源, 賈瑋, 邊蕤, 邱俊濤. 機器學習實戰：基於 Scikit-Learn 和 TensorFlow[M]. 北京: 機械工業出版社, 2018: 115-117.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

監督學習 | 非線性迴歸之多項式迴歸原理及Sklearn實現

文章目錄

1. 多項式迴歸

2. Sklearn 實現

參考資料

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

通過f-string編寫簡潔高效的Python格式化輸出代碼

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

機器學習 | 目錄（持續更新）

無監督學習 | GMM 高斯混合聚類原理及Sklearn實現

無監督學習 | KMeans與KMeans++原理

無監督學習 | DBSCAN 原理及Sklearn實現

SQLite | SQLite 與 Pandas 比較篇之一

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

監督學習 | 非線性迴歸 之多項式迴歸原理及Sklearn實現

文章目錄

1. 多項式迴歸

2. Sklearn 實現

參考資料

監督學習 | 非線性迴歸之多項式迴歸原理及Sklearn實現