機器學習

一、概述

1. 什麼是機器學習？

人工智能：通過人工的方法，實現或者近似實現某些需要人類智能處理的問題，都可以稱爲人工智能。
機器學習：一個計算機程序在完成任務T之後，獲得經驗E，而該經驗的效果可以通過P得以表現，如果隨着T的增加，藉助P來表現的E也可以同步增進，則稱這樣的程序爲機器學習系統。
特點：自我完善、自我修正、自我增強。

2. 爲什麼需要機器學習？

簡化或者替代人工方式的模式識別，易於系統的開發維護和升級換代。
對於那些算法過於複雜，或者沒有明確解法的問題，機器學習系統具有得天獨厚的優勢。
借鑑機器學習的過程，反向推理出隱藏在業務數據背後的規則——數據挖掘。

3. 機器學習的類型

有監督學習、無監督學習、半監督學習和強化學習
批量學習和增量學習
基於實例的學習和基於模型的學習

4. 機器學習的流程

數據
- 數據採集
- 數據清洗
機器學習
- 數據預處理
- 選擇模型
- 訓練模型
- 驗證模型
業務
- 使用模型
- 維護和升級

二、數據預處理

import sklearn.preprocessing as sp

1. 均值移除(標準化) Standardization (or Z-score normalization)

通過算法調整令樣本矩陣中每一列(特徵)的平均值爲0，標準差爲1。這樣一來，所有特徵對最終模型的預測結果都有接近一致的貢獻，模型對每個特徵的傾向性更加均衡。

$z = \frac{{x - \mu }}{\sigma }$

Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1 where μ is the mean and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

sp.scale(原始樣本矩陣) -> 經過均值移除後的樣本矩陣

# std.py
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
print('Mean:', raw_samples.mean(axis=0))  # 表示沿着行取平均值，得到每一列的均值
print('S:', raw_samples.std(axis=0))

std_samples = raw_samples.copy()
for col in std_samples.T:
    col_mean = col.mean()
    col_std = col.std()
    col -= col_mean
    col /= col_std
print('手動標準化處理後：', std_samples)
print('處理後的每一列均值：', std_samples.mean(axis=0))
print('處理後的每一列方差：', std_samples.std(axis=0))

std_samples = sp.scale(raw_samples)  # 經過均值移除後的樣本矩陣
print('自動標準化處理後：', std_samples)
print('處理後的每一列均值：', std_samples.mean(axis=0))
print('處理後的每一列方差：', std_samples.std(axis=0))

[[ 3.  -1.5  2.  -5.4]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]
Mean: [ 1.33333333  1.93333333 -0.06666667 -2.53333333]
S: [1.24721913 2.44449495 1.60069429 3.30689515]
手動標準化處理後： [[ 1.33630621 -1.40451644  1.29110641 -0.86687558]
 [-1.06904497  0.84543708 -0.14577008  1.40111286]
 [-0.26726124  0.55907936 -1.14533633 -0.53423728]]
處理後的每一列均值： [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
處理後的每一列方差： [1. 1. 1. 1.]
自動標準化處理後： [[ 1.33630621 -1.40451644  1.29110641 -0.86687558]
 [-1.06904497  0.84543708 -0.14577008  1.40111286]
 [-0.26726124  0.55907936 -1.14533633 -0.53423728]]
處理後的每一列均值： [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
處理後的每一列方差： [1. 1. 1. 1.]

2. 範圍縮放 Min-Max scaling

將樣本矩陣每一列的元素經過某種線性變換，使得所有列的元素都處在同樣的範圍區間內，自行設置min和max。

$kx + b = y \\ k \cdot co{l_{\min }} + b = \min \\ k \cdot co{l_{\max }} + b = \max$

${\begin{pmatrix} col_{\min } & 1 \\ col_{\max } & 1 \end{pmatrix}} {\begin{pmatrix} k\\ b \end{pmatrix}}= {\begin{pmatrix} {\min }\\ {\max } \end{pmatrix}}$

                                                  a      x      b
                                        = np.linalg.solve(a, b)
                                        = np.linalg.lstsq(a, b)[0]

有時候也把以[0, 1]區間作爲目標範圍的範圍縮放稱爲"歸一化"

${x_i}^\prime = \frac{{{x_i} - \min (x)}}{{\max (x) - \min (x)}}$

An alternative approach to Z-score normalization (or standardization) is the so-called Min-Max scaling (often also simply called “normalization” - a common cause for ambiguities).In this approach, the data is scaled to a fixed range usually 0 to 1.The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers. A Min-Max scaling is typically done via the following equation:

範圍縮放器 = sp.MinMaxScaler(feature_range=(min, max))
範圍縮放器.fit_transform(原始樣本矩陣) -> 經過範圍縮放後的樣本矩陣

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
mms_samples = raw_samples.copy()

for col in mms_samples.T:
    col_min = col.min()
    col_max = col.max()
    a = np.array([
        [col_min, 1],
        [col_max, 1]])
    b = np.array([0, 1])
    x = np.linalg.solve(a, b)
    col *= x[0]
    col += x[1]
print(mms_samples)

mms = sp.MinMaxScaler(feature_range=(0, 1))
mms_samples = mms.fit_transform(raw_samples)
print(mms_samples)

[[ 3.  -1.5  2.  -5.4]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]
[[ 1.00000000e+00  0.00000000e+00  1.00000000e+00 -1.11022302e-16]
 [ 0.00000000e+00  1.00000000e+00  4.10256410e-01  1.00000000e+00]
 [ 3.33333333e-01  8.72727273e-01  5.55111512e-17  1.46666667e-01]]
[[1.         0.         1.         0.        ]
 [0.         1.         0.41025641 1.        ]
 [0.33333333 0.87272727 0.         0.14666667]]

3. 歸一化

用每個樣本各個特徵值除以該樣本所有特徵值絕對值之和，以佔比的形式來表現特徵。
sp.normalize(原始樣本矩陣, norm='l1') -> 經過歸一化後的樣本矩陣
- l1 - l1範數，矢量諸元素的絕對值之和
- l2 - l2範數，矢量諸元素的(絕對值的)平方之和
- ln - ln範數，矢量諸元素的絕對值的n次方之和

import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)

# 手動歸一
nor_samples = raw_samples.copy()
for row in nor_samples:
    row_absum = abs(row).sum()
    row /= row_absum
print(nor_samples)
print(abs(nor_samples).sum(axis=1))

# 自動歸一
nor_samples = sp.normalize(raw_samples, norm='l1')
print(nor_samples)
print(abs(nor_samples).sum(axis=1))

[[ 3.  -1.5  2.  -5.4]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]
[[ 0.25210084 -0.12605042  0.16806723 -0.45378151]
 [ 0.          0.625      -0.046875    0.328125  ]
 [ 0.0952381   0.31428571 -0.18095238 -0.40952381]]
[1. 1. 1.]
[[ 0.25210084 -0.12605042  0.16806723 -0.45378151]
 [ 0.          0.625      -0.046875    0.328125  ]
 [ 0.0952381   0.31428571 -0.18095238 -0.40952381]]
[1. 1. 1.]

4. 二值化

根據事先給定閾值，將樣本矩陣中高於閾值的元素設置爲1，否則設置爲0，得到一個完全由1和0組成的二值矩陣。
二值化器 = sp.Binarizer(threshold=閾值)
二值化器.transform(原始樣本矩陣) -> 經過二值化後的樣本矩陣

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)

# 手動二值化
bin_samples = raw_samples.copy()
bin_samples[bin_samples <= 1.4] = 0
bin_samples[bin_samples > 1.4] = 1
print(bin_samples)

# 自動二值化
bin = sp.Binarizer(threshold=1.4)
bin_samples = bin.transform(raw_samples)
print(bin_samples)

[[ 3.  -1.5  2.  -5.4]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]
[[1. 0. 1. 0.]
 [0. 1. 0. 1.]
 [0. 1. 0. 0.]]
[[1. 0. 1. 0.]
 [0. 1. 0. 1.]
 [0. 1. 0. 0.]]

5. 獨熱編碼

用一個只包含一個1和若干個0的序列來表達每個特徵值的編碼方式，藉此既保留了樣本矩陣的所有細節，同時又得到一個只含有1和0的稀疏矩陣，既可以提高模型的容錯性，同時還能節省內存空間。

  ----------------------原始矩陣
  1        3        2
  7        5        4
  1        8        6
  7        3        9
  ----------------------編碼字典
  1:10   3:100   2:1000
  7:01   5:010   4:0100
         8:001   6:0010
                 9:0001
  ----------------------編碼後樣本矩陣
  101001000
  010100100
  100010010
  011000001

獨熱編碼器 = sp.OneHotEncoder(sparse=是否緊縮(缺省True), dtype=類型)
獨熱編碼器.fit_transform(原始樣本矩陣) -> 經過獨熱編碼後的樣本矩陣

import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [1, 3, 2],
    [7, 5, 4],
    [1, 8, 6],
    [7, 3, 9]])
print('Raw array:\n', raw_samples)

# 建立編碼字典列表
code_tables = []
for col in raw_samples.T:
    # 針對一列的編碼字典
    code_table = {}
    for val in col:
        code_table[val] = None
    code_tables.append(code_table)
    
# 爲編碼字典列表中每個編碼字典添加值
for code_table in code_tables:
    size = len(code_table)
    for one, key in enumerate(sorted(code_table.keys())):
        code_table[key] = np.zeros(shape=size, dtype=int)
        code_table[key][one] = 1
        
# 根據編碼字典表對原始樣本矩陣做獨熱編碼
ohe_samples = []
for raw_sample in raw_samples:
    ohe_sample = np.array([], dtype=int)
    for i, key in enumerate(raw_sample):
        ohe_sample = np.hstack((ohe_sample, code_tables[i][key]))
    ohe_samples.append(ohe_sample)
ohe_samples = np.array(ohe_samples)
print('手動編碼：\n', ohe_samples)

# 自動獨熱編碼
ohe = sp.OneHotEncoder(sparse=False, dtype=int, categories='auto')
ohe_samples = ohe.fit_transform(raw_samples)
print('自動編碼：\n', ohe_samples)
# 自動獨熱編碼(緊縮模式)
ohe = sp.OneHotEncoder(dtype=int, categories='auto')
ohe_samples = ohe.fit_transform(raw_samples)
print('自動編碼：\n', ohe_samples)

Raw array:
 [[1 3 2]
 [7 5 4]
 [1 8 6]
 [7 3 9]]
手動編碼：
 [[1 0 1 0 0 1 0 0 0]
 [0 1 0 1 0 0 1 0 0]
 [1 0 0 0 1 0 0 1 0]
 [0 1 1 0 0 0 0 0 1]]
自動編碼：
 [[1 0 1 0 0 1 0 0 0]
 [0 1 0 1 0 0 1 0 0]
 [1 0 0 0 1 0 0 1 0]
 [0 1 1 0 0 0 0 0 1]]
自動編碼：
   (0, 0)	1
  (0, 2)	1
  (0, 5)	1
  (1, 1)	1
  (1, 3)	1
  (1, 6)	1
  (2, 0)	1
  (2, 4)	1
  (2, 7)	1
  (3, 1)	1
  (3, 2)	1
  (3, 8)	1

6. 標籤編碼

文本形式的特徵值->數值形式的特徵值，其編碼數值源於標籤字符串的字典排序，與標籤本身的含義無關
```
  職位    車     編碼爲：
  員工  toyota     0
  組長  ford       1
  經理  audi       2
  老闆  bmw        3
```
標籤編碼器 = sp.LabelEncoder()
標籤編碼器.fit_transform(原始樣本矩陣) -> 經過標籤編碼後的樣本矩陣
標籤編碼器.inverse_transform(經過標籤編碼後的樣本矩陣) -> 原始樣本矩陣

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array(['audi', 'ford', 'audi', 'toyota', 'ford', 'bmw', 'toyota', 'bmw'])
print('Raw array:\n', raw_samples)

lbe = sp.LabelEncoder()
lbe_samples = lbe.fit_transform(raw_samples)
print('Labeled:\n', lbe_samples)

raw_samples = lbe.inverse_transform(lbe_samples)
print('Inverse:\n', raw_samples)

Raw array:
 ['audi' 'ford' 'audi' 'toyota' 'ford' 'bmw' 'toyota' 'bmw']
Labeled:
 [0 2 0 3 2 1 3 1]
Inverse:
 ['audi' 'ford' 'audi' 'toyota' 'ford' 'bmw' 'toyota' 'bmw']

三、機器學習的基本問題

迴歸問題：由已知的分佈於連續域中的輸入和輸出，通過不斷地模型訓練，找到輸入和輸出之間的聯繫，通常這種聯繫可以通過一個函數方程被形式化，如：y=w0+w1x+w2x^2…，當提供未知輸出的輸入時，就可以根據以上函數方程，預測出與之對應的連續域輸出。
分類問題：如果將回歸問題中的輸出從連續域變爲離散域，那麼該問題就是一個分類問題。
聚類問題：從已知的輸入中尋找某種模式，比如相似性，根據該模式將輸入劃分爲不同的集羣，並對新的輸入應用同樣的劃分方式，以確定其歸屬的集羣。
降維問題：從大量的特徵中選擇那些對模型預測最關鍵的少量特徵，以降低輸入樣本的維度，提高模型的性能。

機器學習MachineLearning概述(簡單預處理)

機器學習

一、概述

1. 什麼是機器學習？

2. 爲什麼需要機器學習？

3. 機器學習的類型

4. 機器學習的流程

二、數據預處理

1. 均值移除(標準化) Standardization (or Z-score normalization)

2. 範圍縮放 Min-Max scaling

3. 歸一化

4. 二值化

5. 獨熱編碼

6. 標籤編碼

三、機器學習的基本問題

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

移位操作搞定兩數之商

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

PythonNET網絡編程3

Linux升級Python3.7

機器學習MachineLearning概述(簡單預處理)

迴歸Regression(一元線性迴歸、嶺迴歸、多元線性迴歸、多項式迴歸)

Skew and Kurtosis (峯度和偏度) 轉載

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結