迴歸分類學習

安裝python和迴歸分類實驗

目前Python的科學計算功能非常強大,R語言也是一門科學計算語言,但鑑於Python的普適性,將使用Python語言來描述統計學習課程的內容

  • Markdown和擴展Markdown簡潔的語法
  • Python的安裝和環境配置
  • 數據集
  • 簡單線性迴歸
  • 多元線性迴歸
  • 分類
  • 豐富的快捷鍵

快捷鍵

  • 加粗 Ctrl + B
  • 斜體 Ctrl + I
  • 引用 Ctrl + Q
  • 插入鏈接 Ctrl + L
  • 插入代碼 Ctrl + K
  • 插入圖片 Ctrl + G
  • 提升標題 Ctrl + H
  • 有序列表 Ctrl + O
  • 無序列表 Ctrl + U
  • 橫線 Ctrl + R
  • 撤銷 Ctrl + Z
  • 重做 Ctrl + Y

Markdown及擴展

Markdown 是一種輕量級標記語言,它允許人們使用易讀易寫的純文本格式編寫文檔,然後轉換成格式豐富的HTML頁面。 —— [ 維基百科 ]

使用簡單的符號標識不同的標題,將某些文字標記爲粗體或者斜體,創建一個鏈接等,詳細語法參考幫助?。

本編輯器支持 Markdown Extra ,  擴展了很多好用的功能。具體請參考[Github][2].

Python的安裝和環境配置

  • Windows:

    • 1、下載安裝包
      https://www.python.org/downloads/ (根據自己電腦配置選擇32位或者64位)
    • 2、安裝
      默認安裝路徑:C:\python27
    • 3、配置環境變量
      【右鍵計算機】–》【屬性】–》【高級系統設置】–》【高級】–》【環境變量】–》【在第二個內容框中找到 變量名爲Path 的一行,雙擊】 –> 【Python安裝目錄追加到變值值中,用 ; 分割】
  • Linux(Ubuntu):

    無需安裝,原裝Python環境

Python的包管理工具pip的安裝與使用

Windows:
下載pip的安裝包get-pip.py,下載地址:https://pip.pypa.io/en/latest/installing.html#id7
Linux(Ubuntu): sudo apt-get install python-pip

交互運算環境

• IPython, an advanced Python consolehttp://ipython.org/ • Jupyter,
notebooks in the browser http://jupyter.org/ • Anaconda
https://www.anaconda.com/download/
• WinPython(https://winpython.github.io/) • Spyder • Pycharm
收費,比較好用一款IDE。 建議用免費community版 常用的庫: • pandas, statsmodels, seaborn for
statistics • sympy for symbolic computing • scikit-image for image
processing • scikit-learn for machine learning

注意:個人推薦使用anaconda2,它集成了jupyter,notebooks和Spyder,是非常好用的Python IDE,特別是jupyter notebook。

數據集

數據集可參考: https://archive.ics.uci.edu/ml/,即UCI庫,常用機器學習數據集有iris和Boston。 scikit-learn是Python的一個開源機器學習模塊(http://scikit-learn.org/dev/

機器學習的核心是 Use Data Answer Question~

導入波士頓房價數據集並查看描述:

from sklearn import datasets 
iris = datasets.load_iris()
#print iris.data

import sklearn.datasets
boston = sklearn.datasets.load_boston()
print boston.DESCR

boston.data 獲取這506 * 13的特徵數據;
boston.target 獲取對應的506 * 1的對應價格

簡單線性迴歸

用Boston數據集進行簡單線性迴歸。


  • 1) 庫

sklearn.datasets, sklearn.linear_model, numpy( numpy.random,numpy.linalg )lm(), matplotlib

  • 2) 要求及步驟

a) 劃分數據集,分訓練集和測試集; 用sklearn.linear_model.LinearRegression()完成一個簡單線性迴歸,瞭解預測變量和響應變量之間的關係,關係強弱,正負相關性。
b) 繪製響應變量和預測變量關係圖,繪製最小二乘迴歸線。
c) 使用LinearRegression模型自帶的評估模塊,並輸出評估結果。
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
d) 使用線性迴歸模型LinearRegression和SGDRegressor分別對波士頓房價數據進行訓練及預測,給出評估結果

  1. 劃分數據集,簡單線性迴歸
    首先學習線性迴歸庫:
    help(linear_model.LinearRegression)
    class LinearRegression(LinearModel, sklearn.base.RegressorMixin)
 |  Parameters
 |  ----------
 |  fit_intercept : boolean, optional
 |      whether to calculate the intercept for this model. If set
 |      to false, no intercept will be used in calculations
 |      (e.g. data is expected to be already centered).
 |  
 |  normalize : boolean, optional, default False
 |      If True, the regressors X will be normalized before regression.
 |      This parameter is ignored when `fit_intercept` is set to False.
 |      When the regressors are normalized, note that this makes the
 |      hyperparameters learnt more robust and almost independent of the number
 |      of samples. The same property is not valid for standardized data.
 |      However, if you wish to standardize, please use
 |      `preprocessing.StandardScaler` before calling `fit` on an estimator
 |      with `normalize=False`.
 |  
 |  copy_X : boolean, optional, default True
 |      If True, X will be copied; else, it may be overwritten.
 |  
 |  n_jobs : int, optional, default 1
 |      The number of jobs to use for the computation.
 |      If -1 all CPUs are used. This will only provide speedup for
 |      n_targets > 1 and sufficient large problems.
 |  
 |  Attributes
 |  ----------
 |  coef_ : array, shape (n_features, ) or (n_targets, n_features)
 |      Estimated coefficients for the linear regression problem.
 |      If multiple targets are passed during the fit (y 2D), this
 |      is a 2D array of shape (n_targets, n_features), while if only
 |      one target is passed, this is a 1D array of length n_features.
 |  
 |  residues_ : array, shape (n_targets,) or (1,) or empty
 |      Sum of residuals. Squared Euclidean 2-norm for each target passed
 |      during the fit. If the linear regression problem is under-determined
 |      (the number of linearly independent rows of the training matrix is less
 |      than its number of linearly independent columns), this is an empty
 |      array. If the target vector passed during the fit is 1-dimensional,
 |      this is a (1,) shape array.
 |  
 |      .. versionadded:: 0.18
 |  
 |  intercept_ : array
 |      Independent term in the linear model.
 |  
 |  Notes
 |  -----
 |  From the implementation point of view, this is just plain Ordinary
 |  Least Squares (scipy.linalg.lstsq) wrapped as a predictor object.
 |  
 |  Method resolution order:
 |      LinearRegression
 |      LinearModel
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.base.RegressorMixin
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit linear model.
 |      
 |      Parameters
 |      ----------
 |      X : numpy array or sparse matrix of shape [n_samples,n_features]
 |          Training data
 |      
 |      y : numpy array of shape [n_samples, n_targets]
 |          Target values
 |      
 |      sample_weight : numpy array of shape [n_samples]
 |          Individual weights for each sample
 |      
 |          .. versionadded:: 0.17
 |             parameter *sample_weight* support to LinearRegression.
 |      
 |      Returns
 |      -------
 |      self : returns an instance of self.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  residues_
 |      DEPRECATED: ``residues_`` is deprecated and will be removed in 0.19
 |      
 |      Get the residues of the fitted model.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from LinearModel:
 |  
 |  decision_function(*args, **kwargs)
 |      DEPRECATED:  and will be removed in 0.19.
 |      
 |      Decision function of the linear model.
 |      
 |              Parameters
 |              ----------
 |              X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |                  Samples.
 |      
 |              Returns
 |              -------
 |              C : array, shape = (n_samples,)
 |                  Returns predicted values.
 |  
 |  predict(self, X)
 |      Predict using the linear model
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape = (n_samples,)
 |          Returns predicted values.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self)
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.RegressorMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the coefficient of determination R^2 of the prediction.
 |      
 |      The coefficient R^2 is defined as (1 - u/v), where u is the regression
 |      sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual
 |      sum of squares ((y_true - y_true.mean()) ** 2).sum().
 |      Best possible score is 1.0 and it can be negative (because the
 |      model can be arbitrarily worse). A constant model that always
 |      predicts the expected value of y, disregarding the input features,
 |      would get a R^2 score of 0.0.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True values for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          R^2 of self.predict(X) wrt. y.

其中主要的庫函數有fit(),predict(),score()
用fit去學習迴歸,用predict去測試,score用來求均方誤差即打分

劃分數據集。使用第六維數據即平均房間數量進行簡單線性迴歸:

%matplotlib inline 
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error

boston_x = boston.data[:,np.newaxis,5]

x_train = boston_x [:400]
x_test = boston_x[451:]
y_train = boston.target [:400]
y_test = boston.target [451:]


lr=sklearn.linear_model.LinearRegression()

lr.fit(x_train, y_train)

y_pred = lr.predict(x_test)


socre_test = lr.score(x_test, y_test)
print('Score of test : %.2f ' % socre_test)

結果:Score of test : -1.07

畫圖以及輸出評估結果

print('Coefficients:', lr.coef_)
print('MSE: %.2f' % mean_squared_error(y_test,y_pred))

print('Variance score: %f ' % r2_score(y_test,y_pred))

print('MAE:%.2f' %mean_absolute_error(y_test,y_pred))
plt.scatter(x_test,y_test, color='black')
plt.plot(x_test,y_pred,color='blue',linewidth=3)

結果:
('Coefficients:', array([ 9.40550212]))
MSE: 37.47
Variance score: -1.066368
MAE:4.79
Out[41]:
[<matplotlib.lines.Line2D at 0xbb89828>]
畫出的最小二乘曲線

嘗試用其他維度數據進行迴歸和畫圖



用SGD迴歸(梯度下降法):

clf = linear_model.SGDRegressor()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
plt.scatter (x_test, y_test, color='black')
plt.scatter(x_test,y_pred, color='blue' )
print("SGD_score: %.2f" % clf.score(x_test,y_test))

`SGD_score: -0.66
SGD繪製響應關係圖

2. 多元線性迴歸

用Boston房價數據集進行多元線性迴歸。
  • 1.要求和步驟

    • a.作出數據集中的所有變量的散點圖矩陣; matplotlib.pyplot.scatter(a,b)

      散點圖矩陣即散點圖的集合,用以反映多元變量之間的相關性。
      散點圖矩陣的還有一個特點是對角陣是自相關散點圖,並且是一個上三角矩陣。
      在本實驗中,除了有13維的數據還有響應變量Boston的房價數據,而相比於13維預測變量之間的相關性,
      我們更加關心的是響應變量和其他13維預測變量之間的相關性。
      因此在實現中,我除了選取了前6項維度數據做出 6*6 維的散點圖矩陣以外,在矩陣最右列添加了響應變量Boston房價的散點圖。

      另外因爲做出的13維*13維散點圖矩陣圖太大了以至於很難看清,因此只作出了了6*6維的散點圖矩陣。

#fig
#plt.subplot(441)
#plt.scatter(boston.data[:,0], boston.data[:,0] )
#plt.subplot(442)
#plt.scatter(boston.data[:,0], boston.data[:,1] )
#plt.subplot(443)
#plt.scatter(boston.data[:,0], boston.data[:,2] )
#plt.subplot(444)
#plt.scatter(boston.data[:,0], boston.target )
#


fig, axes = plt.subplots(nrows=6, ncols=7,sharex=True, sharey=True,
                           figsize=(19,12))
fig.tight_layout() # Or equivalently,  "plt.tight_layout()"


for i in range(0,6,1):
    for j in range(0,6,1):
        ax=7*(i)+j+1
        subax=plt.subplot(6,7,ax)
        plt.legend()
        subax.set_xticks([])  
        subax.set_yticks([]) 
        plt.scatter(boston.data[:,i], boston.data[:,j] )

for k in range(1,7,1):
    ax=plt.subplot(6,7,7*k)

    plt.scatter(boston.data[:,k-1], boston.target )

plt.show()
  • b.計算變量之間的相關係數矩陣。

    以下作出了6+1*6+1維的相關係數矩陣和包含了13維數據以及房價的14*14的所有變量之間的兩種相關係數矩陣

import numpy as np
y=[boston.data[:,0],boston.data[:,1],boston.data[:,2],boston.data[:,3],
   boston.data[:,4],boston.data[:,5],boston.target]
np.corrcoef(y)

array([[ 1. , -0.19945796, 0.4044707 , -0.05529526, 0.41752143,
-0.21993979, -0.38583169],
[-0.19945796, 1. , -0.53382819, -0.04269672, -0.51660371,
0.31199059, 0.36044534],
[ 0.4044707 , -0.53382819, 1. , 0.06293803, 0.76365145,
-0.39167585, -0.48372516],
[-0.05529526, -0.04269672, 0.06293803, 1. , 0.09120281,
0.09125123, 0.17526018],
[ 0.41752143, -0.51660371, 0.76365145, 0.09120281, 1. ,
-0.30218819, -0.42732077],
[-0.21993979, 0.31199059, -0.39167585, 0.09125123, -0.30218819,
1. , 0.69535995],
[-0.38583169, 0.36044534, -0.48372516, 0.17526018, -0.42732077,
0.69535995, 1. ]])

import numpy as np
y=[boston.data[:,0], boston.data[:,1], boston.data[:,2], 
   boston.data[:,3], boston.data[:,4], boston.data[:,5],
   boston.data[:,6], boston.data[:,7], boston.data[:,8],
   boston.data[:,9], boston.data[:,10], boston.data[:,11],
   boston.data[:,12], boston.target]
np.corrcoef(y)
  • c.用polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=2).fit_transform(boston.data)進行多元線性迴歸,給出性能評估。
    實驗要求即多元線性迴歸之中的多項式迴歸,這裏要求二階的多元多項式迴歸。 在如下第一張圖是一階的簡單多元線性迴歸,如圖藍線是多元迴歸預測的房價,黑色是原始房價,對預測集的多元迴歸打分爲-0.38; 如下第二張圖是實驗要求的多項式迴歸,對預測集的打分爲-25.86; 如下第三張圖是多項式迴歸對訓練集的線性迴歸擬合,打分爲0.93; 第4張圖是 一階的簡單多元迴歸對訓練集的擬合,打分爲0.73.
    可以看出多項式迴歸在對訓練集數據的多元線性迴歸擬合方面具有改善的性能,但是多項式迴歸在預測方面的表現很差, 不如一階簡單多元線性迴歸,甚至多項式迴歸隨着階數越高,對訓練集的擬合結果越好,直到出現過擬合,此時對預測的性能就非常糟糕了。
    此外,似乎以後分類中的LDA和QDA也有相似的情況。
polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=2).fit_transform(boston.data)
regre = sklearn.linear_model.LinearRegression()
model = regre.fit(boston.data[:400,:], boston.target[:400])
multi_predict= regre.predict(boston.data[401:,:])
plt.title("multi_linearRegression")
plt.plot(range(401,506), boston.target[401:], color='black', label = "target")
plt.legend()
plt.plot(range(401,506), multi_predict, color='blue', label = "multi_predict")
plt.legend()
print(regre.score(boston.data[401:,:], boston.target[401:]))
regre = sklearn.linear_model.LinearRegression()
polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=2).fit_transform(boston.data)

model = regre.fit(polynominalData[:400,:], boston.target[:400])
multi_predict= regre.predict(polynominalData[401:,:])
plt.title("polynomina")
plt.plot(range(401,506), boston.target[401:], color='black', label = "target")
plt.legend()
plt.plot(range(401,506), multi_predict, color='blue', label = "poly_predict")
plt.legend()
print(regre.score(polynominalData[401:,:], boston.target[401:]))
regre = sklearn.linear_model.LinearRegression()

model = regre.fit(boston.data[:400,:], boston.target[:400])

multi_predict= regre.predict(boston.data[:400,:])
plt.title("multi_linearRegression")
plt.plot(range(0,100), boston.target[:100], color='black', label = "ori_target")
plt.legend()
plt.plot(range(0,100), multi_predict[0:100], color='blue', label = "predict")
plt.legend()
print(model.score(boston.data[:400,:], boston.target[:400]))
from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

lr = linear_model.LinearRegression()
boston = datasets.load_boston()
y = boston.target
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, boston.data, boston.target, cv=10)

cross_score = cross_val_score(lr, boston.data, boston.target, cv=10)

print(cross_score)

fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
  • e.隨着通過劃分比例觀察訓練數據量的增加,對訓練的性能評分的變化和測試評分的變化?繪製一個曲線。

    劃分比例函數train_test_split,這裏測試集取4%~94%的100個點,訓練集取96%~6%的100個點。
    畫出性能曲線,X座標爲測試集的比例,藍線爲測試集的打分,黑線爲訓練集的打分。
    可以看見訓練集打分一直很穩定,並且在訓練集比例減少到75%左右性能上升,測試的評分則在測試集比例打到94%左右時突降,
    究其原因,很顯然,訓練集樣本太少時對於訓練更加容易,但是訓練樣本過少以至於無法對95%的測試集做出好的預測甚至根本不能預測。

    由於劃分測試集的比例在94%時激劇突降,因此我們另外畫一幅測試集只到93%不出現突降的曲線。

from sklearn.cross_validation import train_test_split
from numpy import *

ratio =  [float(j)/110 + 0.04 for j in range(100)]
train_score = zeros(100)
test_score = zeros(100)
regre = sklearn.linear_model.LinearRegression()
for i in range(0,100):
    [X_train, X_test, y_train, y_test ]= train_test_split(boston.data, boston.target, test_size = ratio[i], random_state=0)


    model = regre.fit(X_train, y_train)

    train_score[i] = regre.score(X_train, y_train)
    test_score[i] = regre.score(X_test, y_test)
print train_score
print test_score
plt.plot( ratio, train_score, color='black',label ="train_score" )
plt.legend()

plt.plot( ratio, test_score, color='blue',label ="test_score")
plt.legend()

另外畫一幅測試集取3%到94%,訓練集取6%到97%的曲線,這裏不會出現突降的情況,並且橫座標換成了訓練集的比例。可以看出,在不出現性能突降的情況下,隨着訓練集比例的逐漸增加,訓練評分逐漸降低,測試評分則在穩定一段後出現波動下降,並在訓練集比例80%左右開始下降,在90%左右測試評分開始上升。

from sklearn.cross_validation import train_test_split
from numpy import *

ratio =  [1-(float(j)/110 + 0.03) for j in range(100)]
train_score = zeros(100)
test_score = zeros(100)
regre = sklearn.linear_model.LinearRegression()
for i in range(0,100):
    [X_train, X_test, y_train, y_test ]= train_test_split(boston.data, boston.target,
                                                          test_size = (1 - ratio[i]), random_state=0)


    model = regre.fit(X_train, y_train)

    train_score[i] = regre.score(X_train, y_train)
    test_score[i] = regre.score(X_test, y_test)

plt.plot( ratio, train_score, color='black',label ="train_score" )
plt.legend()

plt.plot( ratio, test_score, color='blue',label ="test_score")
plt.legend()

Classification

根據手寫體數據集,熟悉如何對圖像進行分類。
        # Import matplotlib
        import matplotlib.pyplot as plt 

        from sklearn import datasets
        digits = datasets.load_digits()

        # Join the images and target labels in a list
        images_and_labels = list(zip(digits.images, digits.target))

        # for every element in the list
        for index, (image, label) in enumerate(images_and_labels[:8]):
            # initialize a subplot of 2X4 at the i+1-th position
            plt.subplot(2, 4, index + 1)
            # Don't plot any axes
            plt.axis('off')
            # Display images in all subplots 
            plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
            # Add a title to each subplot
            plt.title('Training: ' + str(label))

        # Show the plot
        plt.show()
"""
================================
Recognizing hand-written digits
================================

An example showing how the scikit-learn can be used to recognize images of
hand-written digits.

This example is commented in the
:ref:`tutorial section of the user manual <introduction>`.

"""
print(__doc__)

# Author: Gael Varoquaux <gael dot varoquaux at normalesup dot org>
# License: BSD 3 clause

# Standard scientific Python imports
import matplotlib.pyplot as plt

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics

# The digits dataset
digits = datasets.load_digits()

# The data that we are interested in is made of 8x8 images of digits, let's
# have a look at the first 4 images, stored in the `images` attribute of the
# dataset.  If we were working from image files, we could load them using
# matplotlib.pyplot.imread.  Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the 'target' of
# the dataset.
images_and_labels = list(zip(digits.images, digits.target))   #list images and labels
for index, (image, label) in enumerate(images_and_labels[:4]):#just look at the first 4 images
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')#plt.cm interpolation
    plt.title('Training: %i' % label)

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))#turn image data in a matrix

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)#SVM classifier with gamma=0.001

# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples // 2], digits.target[:n_samples // 2])#fit the first half of data and target

# Now predict the value of the digit on the second half:
expected = digits.target[n_samples // 2:]#y_test 
predicted = classifier.predict(data[n_samples // 2:])#y_predict

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))#confusion_matrix 

images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))#images and prdictions's results
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
    plt.subplot(2, 4, index + 5)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Prediction: %i' % prediction)

plt.show()
  • 3.Use KNN Classification model, recognition of handwritten data sets. Discuss classification performance changes when k changes.

實現K-近鄰算法

K-近鄰算法的具體思想如下:

  • (1)計算已知類別數據集中的點與當前點之間的距離

  • (2)按照距離遞增次序排序

  • (3)選取與當前點距離最小的k個點

  • (4)確定前k個點所在類別的出現頻率

  • (5)返回前k個點中出現頻率最高的類別作爲當前點的預測分類

Python語言實現K-近鄰算法的代碼如下:

#coding: utf-8
import matplotlib.pyplot as plt
import numpy as np
import operator
from sklearn import datasets, svm, metrics
digits = datasets.load_digits()

# Join the images and target labels in a list
images_and_labels = list(zip(digits.images, digits.target))

# for every element in the list
for index, (image, label) in enumerate(images_and_labels[:8]):
    # initialize a subplot of 2X4 at the i+1-th position
    plt.subplot(2, 4, index + 1)
    # Don't plot any axes
    plt.axis('off')
    # Display images in all subplots 
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    # Add a title to each subplot
    plt.title('Training: ' + str(label))

# Show the plot
plt.show()

from sklearn import neighbors


def knn_class(inX, datasets, labels, k):
    dataset_size = datasets.shape[0]
    data_diff = (tile(inX, (dataset_size, 1)) - datasets) ** 2
    sq_diff = data_diff.sum(axis =1 )
    diff = sq_diff ** 0.5
    sort_diff = diff.argsort()
    class_count={}
    for i in range(k):
        numOflabel = labels[sort_diff[i]]
        class_count[numOflabel] = classCount.get(numOflabel,0) + 1
    sortedClassCount = sorted(class_count.iteritems(), key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]


X_digits = digits.data
y_digits = digits.target

n_samples = len(X_digits)

X_train = X_digits[:int(.5 * n_samples)]
y_train = y_digits[:int(.5 * n_samples)]
X_test = X_digits[int(.5 * n_samples):]
y_test = y_digits[int(.5 * n_samples):]

for i in range(9):

    knn = neighbors.KNeighborsClassifier( n_neighbors=i+1)
    print('KNN score: %.5f  k= %f '  %(knn.fit(X_train, y_train).score(X_test, y_test), i+1))
  • 4.use SVM(Scikit-learn) knn diff
images_and_labels = list(zip(digits.images, digits.target))   #list images and labels
for index, (image, label) in enumerate(images_and_labels[:4]):#just look at the first 4 images
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')#plt.cm interpolation
    plt.title('Training: %i' % label)

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))#turn image data in a matrix

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)#SVM classifier with gamma=0.001

# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples // 2], digits.target[:n_samples // 2])#fit the first half of data and target

# Now predict the value of the digit on the second half:
expected = digits.target[n_samples // 2:]#y_test 
predicted = classifier.predict(data[n_samples // 2:])#y_predict
svm_score = classifier.score(data[n_samples // 2:],expected)
print(svm_score)

Conclusion:

svm is better than knn

knn svm
0.9632 0.9689

Reff:
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

* 附:中途發現運行結果的圖片太大了就不上傳了 *

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章