Logistic Regression by Sklearn
sklearn.linear_model.LogisticRegression
penalty: l1 ,l2 regulization, elasticent(combine l1, l2) or non(not any regulization)
C: the inverse of regulzrization strength
tol:.Tolerance for stopping criteria
solver:method for optimization SGD, newton-cg,
fit_intercept: add bias for linear mode (parameter b)
dual: Primal formulation or Dual problem(such as SVM)
class_weight: balancing between the positive sample and negative sample.
Regulization
Linear Separable :https://www.cnblogs.com/lgdblog/p/6858832.html
If the imput data set is linear separable, the parameter w will goes into infinite by logistic regression
if the logistic Regression choose the w of the line 2,the w will become infinity
We need to prevent w from becoming very large
Add the regulization term L2 norm
λ is the hyperparameter
using crossvalidation to search the λ
We can also add the L1 norm for calculation
Gradient Descent for Regulization Term
Stochastic Gradient Descent for Regulization Term
if the λ of the regulization term is large, then the w will become small, otherwise
Overfitting
The parameter of one model will become large. One model performance well in the training data, but the performance is bad in the test data.
Some times the reason is that we train model and get the very very big parameter, such as (5000, 0.4, 89887)
Generalization, Overfitting and Regulization
Our goal is to build a model which is strong generalization capability, that it will fit the data in real environment.
1. choose the correct data,we can't build a good model via the data which have a lot of errors.
Such as that the data including a lot of noise.
2. choose the appropriate model,IE,for image recognition we choose the CNN,
3. choose the approprate optimization method for different model.
4. avoid overfitting
The model C is an overfitting
The complexity of Model
The number of parameters
Does not have Regulization
How to prevent from overfitting?
Deep into regulization
We use the regulizaiton to limit the feasible region to prevent from overfitting
the λ controls the limitation of the feasible region
L1 and L2 Regulization
L1 norm could help use to choose some features due to it will bring the sparsity. The result is that some parameters will become zero.
The Geometric representation between L1 and L2
L1 is using to control the dimension of the training data. otherwise we choose L2 regulization.
Check L1 and L2 regulizaiton:
# generate the random data for two classify problem. the size =5000
import numpy as np
np.random.seed(12)
num_observations = 100 # generate 100 of positive and negative samples
# using the gaussian distribution to generate the sample, build the covariance matrix
# we generate the 20-dimension of the training data, so the covariance matrix is 20 x 20
rand_m = np.random.rand(20, 20)
# make sure that the covariance matrix is semi-defined positive
cov = np.matmul(rand_m.T, rand_m)
# generate the training samples by gaussian distribution
x1 = np.random.multivariate_normal(np.random.rand(20), cov, num_observations)
x2 = np.random.multivariate_normal(np.random.rand(20) + 5, cov, num_observations)
X = np.vstack((x1, x2)).astype(np.float32)
y = np.hstack((np.zeros(num_observations), np.ones(num_observations)))
from sklearn.linear_model import LogisticRegression
# using the L1 regulization, C to control the regulization, when C is large, the regulization is small.
clf = LogisticRegression(fit_intercept = True, C = 0.1, penalty = 'l1', solver = 'liblinear')
clf.fit(X, y)
print("(L1) Logistic Regression with parameters:\n", clf.coef_)
# using the L2 regulization, C to control the regulization, when C is large, the regulization is small.
clf = LogisticRegression(fit_intercept = True, C = 0.1, penalty = 'l2', solver = 'liblinear')
clf.fit(X, y)
print("(L2) Logistic Regression with parameters:\n", clf.coef_)
The result:
(L1) Logistic Regression with parameters:
[[ 0. 0. 0.00380416 0. 0.0708005 -0.29471144
0. -0.34068405 0. 0.78465456 0. 0.10128216
0. 0. 0. -0.04115284 0. 0.
0.41517583 0. ]]
(L2) Logistic Regression with parameters:
[[-0.06020774 -0.08587293 0.06269959 0.0218838 0.36622515 -0.45899841
0.11456309 -0.44218794 -0.24780618 0.87767764 -0.32403048 0.27800343
0.34313572 0.16393398 -0.14322159 -0.22759078 0.09331433 -0.22950935
0.48553032 0.1213868 ]]
Once the object function include the L1 regulization term, it will meet more challenge during the optimization.
A. The main reason is that the L1 norm doesn't have the gradient during the 0 point, so we need to using the subgradient instead of the gradient.
There is no gradient during the origin point.
B. the L1 regulization will choose one of multiple features randomly, but sometime this feature is very important!
Combine the L1 and L2 to ElasticNet
https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf
Regulization with the Maximum a posteriori(MAP)
We will discuss the relationship between the MLE and MAP.
MLE(Maximum Likelihood Estimation): we choose the best parameter θ to maximum the probability P(D|θ), D is the training data set. θ is the parameter of the model.
It is used to build the object function of the model.
Sometimes we need to consider the effect of the pre-knowledge to our model. it results the Maximum a posteriori problem(MAP).
we need to maximum the P(θ|D),where D is the training data set, and the θ is the parameter of the model.
we can choose the distrubition for the prior probability
MAP add one more term respect to MLE (prior probability, considered as regulizationterm)
Construct object function respect logistic Regression by MAP
for MLE
Assume the P(θ)~N(0, σ^2) observe the gaussian distribution
We get the MAP object function(assume the prior observe the gaussian distribution) as the same as the MLE + L2 regulization term.
Assume that the P(θ)~Laplace(μ, b) assume μ = 0,observe the lapalician distribution
We get the MAP function(assume the prior observe the laplacian distribution) as the same as the MLE + L1 regulization term
The relationship between the MLE and MAP
It means that when the training data size is small, we'd better choose the MAP which will include the regulization term.
When the data size is large, we could use only the MLE.
When the data size is infinite, the MAP -> MLE
Reference:https://zhuanlan.zhihu.com/p/72370235