Overfitting and Regulization in Machine Learning

Logistic Regression by Sklearn

sklearn.linear_model.LogisticRegression

penalty: l1 ,l2 regulization, elasticent(combine l1, l2) or non(not any regulization)

C: the inverse of regulzrization strength

tol:.Tolerance for stopping criteria

solver:method for optimization SGD, newton-cg,

fit_intercept: add bias for linear mode (parameter b)

dual: Primal formulation or Dual problem(such as SVM)

class_weight: balancing between the positive sample and negative sample.

Regulization

Linear Separable ：https://www.cnblogs.com/lgdblog/p/6858832.html

If the imput data set is linear separable， the parameter w will goes into infinite by logistic regression

if the logistic Regression choose the w of the line 2，the w will become infinity

We need to prevent w from becoming very large

Add the regulization term L2 norm

λ is the hyperparameter

using crossvalidation to search the λ

We can also add the L1 norm for calculation

Gradient Descent for Regulization Term

Stochastic Gradient Descent for Regulization Term

if the λ of the regulization term is large， then the w will become small， otherwise

Overfitting

The parameter of one model will become large. One model performance well in the training data, but the performance is bad in the test data.

Some times the reason is that we train model and get the very very big parameter, such as (5000, 0.4, 89887)

Generalization, Overfitting and Regulization

Our goal is to build a model which is strong generalization capability, that it will fit the data in real environment.

1. choose the correct data，we can't build a good model via the data which have a lot of errors.

Such as that the data including a lot of noise.

2. choose the appropriate model，IE,for image recognition we choose the CNN,

3. choose the approprate optimization method for different model.

4. avoid overfitting

The model C is an overfitting

The complexity of Model

The number of parameters

Does not have Regulization

How to prevent from overfitting？

Deep into regulization

We use the regulizaiton to limit the feasible region to prevent from overfitting

the λ controls the limitation of the feasible region

L1 and L2 Regulization

L1 norm could help use to choose some features due to it will bring the sparsity. The result is that some parameters will become zero.

The Geometric representation between L1 and L2

L1 is using to control the dimension of the training data. otherwise we choose L2 regulization.

Check L1 and L2 regulizaiton:

# generate the random data for two classify problem. the size =5000
import numpy as np

np.random.seed(12)
num_observations = 100 # generate 100 of positive and negative samples

# using the gaussian distribution to generate the sample, build the covariance matrix
# we generate the 20-dimension of the training data, so the covariance matrix is 20 x 20
rand_m = np.random.rand(20, 20)
# make sure that the covariance matrix is semi-defined positive 
cov = np.matmul(rand_m.T, rand_m)

# generate the training samples by gaussian distribution
x1 = np.random.multivariate_normal(np.random.rand(20), cov, num_observations)
x2 = np.random.multivariate_normal(np.random.rand(20) + 5, cov, num_observations)

X = np.vstack((x1, x2)).astype(np.float32)
y = np.hstack((np.zeros(num_observations), np.ones(num_observations)))

from sklearn.linear_model import LogisticRegression
# using the L1 regulization, C to control the regulization, when C is large, the regulization is small.
clf = LogisticRegression(fit_intercept = True, C = 0.1, penalty = 'l1', solver = 'liblinear')
clf.fit(X, y)

print("(L1) Logistic Regression with parameters:\n", clf.coef_)

# using the L2 regulization, C to control the regulization, when C is large, the regulization is small.
clf = LogisticRegression(fit_intercept = True, C = 0.1, penalty = 'l2', solver = 'liblinear')
clf.fit(X, y)

print("(L2) Logistic Regression with parameters:\n", clf.coef_)

The result:

(L1) Logistic Regression with parameters:
 [[ 0.          0.          0.00380416  0.          0.0708005  -0.29471144
   0.         -0.34068405  0.          0.78465456  0.          0.10128216
   0.          0.          0.         -0.04115284  0.          0.
   0.41517583  0.        ]]
(L2) Logistic Regression with parameters:
 [[-0.06020774 -0.08587293  0.06269959  0.0218838   0.36622515 -0.45899841
   0.11456309 -0.44218794 -0.24780618  0.87767764 -0.32403048  0.27800343
   0.34313572  0.16393398 -0.14322159 -0.22759078  0.09331433 -0.22950935
   0.48553032  0.1213868 ]]

Once the object function include the L1 regulization term, it will meet more challenge during the optimization.

A. The main reason is that the L1 norm doesn't have the gradient during the 0 point, so we need to using the subgradient instead of the gradient.

There is no gradient during the origin point.

B. the L1 regulization will choose one of multiple features randomly, but sometime this feature is very important!

Combine the L1 and L2 to ElasticNet

https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf

Regulization with the Maximum a posteriori(MAP)

We will discuss the relationship between the MLE and MAP.

MLE(Maximum Likelihood Estimation): we choose the best parameter θ to maximum the probability P(D|θ), D is the training data set. θ is the parameter of the model.

It is used to build the object function of the model.

Sometimes we need to consider the effect of the pre-knowledge to our model. it results the Maximum a posteriori problem(MAP).

we need to maximum the P(θ|D),where D is the training data set， and the θ is the parameter of the model.

we can choose the distrubition for the prior probability

MAP add one more term respect to MLE （prior probability， considered as regulizationterm）

Construct object function respect logistic Regression by MAP

for MLE

Assume the P(θ)~N(0, σ^2) observe the gaussian distribution

We get the MAP object function(assume the prior observe the gaussian distribution) as the same as the MLE + L2 regulization term.

Assume that the P(θ)~Laplace(μ, b) assume μ = 0，observe the lapalician distribution

We get the MAP function(assume the prior observe the laplacian distribution) as the same as the MLE + L1 regulization term

The relationship between the MLE and MAP

It means that when the training data size is small, we'd better choose the MAP which will include the regulization term.

When the data size is large, we could use only the MLE.

When the data size is infinite, the MAP -> MLE

Reference:https://zhuanlan.zhihu.com/p/72370235

Overfitting and Regulization in Machine Learning

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

通過f-string編寫簡潔高效的Python格式化輸出代碼

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

MatchZoo 文本匹配工具包

Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

如何寫好一封paper Summary

Algorithm: k-nearest neighbors and decison boundary(Cross Validation)

基於集成學習模型的估價預測（量化投資）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結