L1正則項與稀疏性

原創

大眼呆萌君

2020-07-04 18:42

題目（164）：L1正則化使得模型參數具有稀疏性的原理是什麼？

回答角度：

幾何角度，即解空間形狀
微積分角度，對帶L1限制的目標函數求導
貝葉斯先驗

解空間形狀

Step 1. 正則條件和限制條件的等價性
Step 2. L1範數與L2範數的幾何形狀
Step 3. 如果原問題目標函數的最優解不在解空間內，那麼約束條件下的最優解一定是在解空間的邊界上。
$\textcolor{red}{\text{[複習KKT, complementary slackness]}}$

微積分、函數疊加

損失函數加入L1正則後，目標函數變爲 $J(\bm \theta) = L(\bm \theta) + c \|\bm \theta\|_1$ 。When $\bm \theta>0$ , the gradient of $c \|\bm \theta\|_1$ equals $c$ ; when $\bm \theta<0$ , the gradient of $c \|\bm \theta\|_1$ equals $-c$ . Therefore, if the gradient of $L(\bm \theta)$ lies within $(-c,c)$ , the gradient of $J(\bm \theta)$ is always negative for $\bm \theta<0$ , indicating that $J(\bm \theta)$ is monotonically decreasing on the left of the origin; its gradient is always positive for $\bm \theta>0$ , indicating monotonic increase on the right of the origin. Therefore, the minimum takes place at $\bm \theta =0$ .

反觀L2正則，原點處導數爲零。The gradient of $J(\bm \theta)$ at the origin equals zero iff the gradient of $L(\bm \theta)$ at the origin equals zero. Therefore, the possibility of having sparse solutions with the $L_2$ -norm regularisation is much less likely than with the $L_1$ -norm regularisation.

複習soft-thresholding和simplifed LASSO problem [2]
$\min_\beta \frac{1}{2} \|y-\beta\|^2_2 + \lambda \|\beta\|_1$

Let $v \in \partial(\beta)$ . By subgradient optimality condition,
$\begin{cases}(y_i-\beta_i)=\lambda \text{sign}(\beta_i) & \text{if } \beta_i \neq 0\\ |y_i - \beta_i| \leq \lambda & \text{if } \beta_i=0.\end{cases}$

When $\beta_i>0$ , $y_i - \beta_i = \lambda$ , requiring that $y_i - \lambda>0$ ; when $\beta_i<0$ , $y_i - \beta_i = -\lambda$ , requiring that $y_i + \lambda>0$ ; when $\beta_i=0$ , $|y_i| \leq \lambda$ . Combining the three cases leads to the soft-thresholding operator:
$S_\lambda(y)_i = \begin{cases} y_i - \lambda & \text{if }y_i > \lambda \\ 0 & \text{if } -\lambda \leq y_i \leq \lambda \\ y_i + \lambda & \text{if } y_i \leq -\lambda .\end{cases}$

貝葉斯先驗

高斯分佈在極值點處是平滑的，即高斯先驗分佈認爲 $\theta$ 在極值點附近取不同值的可能性是接近的。這就是 $L_2$ 正則化只會讓 $\theta$ 更接近0點，但不會等於0的原因。

拉普拉斯分佈在極值點處是一個尖峯，所以拉普拉斯先驗分佈中參數 $\theta$ 取值爲0的可能性要更高。

$\textcolor{red}{\text{[L2,L1對應Gaussian, Laplace的證明]}}$

參考文獻

《百面機器學習》
CMU Convex Optimization 10-725 / 36-725, Subgradients
https://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/06-subgradients-scribed.pdf

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

L1正則項與稀疏性

解空間形狀

微積分、函數疊加

貝葉斯先驗

梯度下降、隨機梯度下降法、及其改進

機器學習中的凸和非凸優化問題

L1正則項與稀疏性

驗證梯度的正確性

Deep Learning相關概念

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結