FTRL算法的論文:https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
動機
全量訓練的問題,樣本量大,訓練時間長,特徵量大,同步時間長,每日全量訓練,花費高,生效遲
增量訓練的好處,增量訓練花銷低,生效塊
發展
SGD-OGDFOBOSRDAFTRLFTML
全部代碼:https://github.com/YEN-GitHub/OnlineLearning_BasicAlgorithm
SGD:梯度下降法(學習率爲恆定)
下面以邏輯迴歸爲例實現每種在線學習
邏輯迴歸
1. 目標函數爲交叉熵,
2. 預測輸出爲
3. 梯度更新公式爲
代碼如下:
class LR(object):
@staticmethod
def fn(w, x):
''' sigmod function '''
return 1.0 / (1.0 + np.exp(-w.dot(x)))
@staticmethod
def loss(y, y_hat):
'''cross-entropy loss function'''
return np.sum(np.nan_to_num(-y * np.log(y_hat) - (1-y)*np.log(1-y_hat)))
@staticmethod
def grad(y, y_hat, x):
'''gradient function'''
return (y_hat - y) * x
OGD (學習率隨着訓練步數目變動)
更新公式爲:
代碼如下:
說明decisionFunc=LR,此中LR是上面的class LR
class OGD(object):
def __init__(self,alpha,decisionFunc=LR):
self.alpha = alpha
self.w = np.zeros(4)
self.decisionFunc = decisionFunc
def predict(self, x):
return self.decisionFunc.fn(self.w, x)
def update(self, x, y,step):
y_hat = self.predict(x)
g = self.decisionFunc.grad(y, y_hat, x)
learning_rate = self.alpha / np.sqrt(step + 1) # damping step size
# SGD Update rule theta = theta - learning_rate * gradient
self.w = self.w - learning_rate * g
return self.decisionFunc.loss(y,y_hat)
TG (對參數進行截斷,產生稀疏解)
SGD方法無法保證解的稀疏性,而大規模數據集與高維特徵又需要稀疏解,比較直觀的方法是梯度截斷法,當參數小於閾值時賦0,或者往0走一步,可以每k步進行一次截斷
代碼如下:
def update(self, x, y, step):
y_hat = self.predict(x)
g = self.decisionFunc.grad(y, y_hat, x)
if step % self.K == 0:
learning_rate = self.alpha / np.sqrt(step+1) # damping step size
temp_lambda = self.K * self.lambda_
for i in range(4):
w_e_g = self.w[i] -learning_rate * g[i]
if (0< w_e_g <self.theta) :
self.w[i] = max(0, w_e_g - learning_rate * temp_lambda)
elif (-self.theta< w_e_g <0) :
self.w[i] = min(0, w_e_g + learning_rate * temp_lambda)
else:
self.w[i] = w_e_g
else:
# SGD Update rule theta = theta - learning_rate * gradient
self.w = self.w - self.alpha * g
return self.decisionFunc.loss(y,y_hat)
FOBOS
L1-fobos可以認爲是TG的一種特殊情況,當,L1-FOBOS與TG等價,也是每次朝着梯度方向走一步,朝着0的方向走一步。相對於TG,FOBOS有了一套優化理論的框架,權重的更新分成以下兩步
求解過程見博客:https://zr9558.com/2016/01/12/forward-backward-splitting-fobos/
得到算法更新公式:
轉化爲代碼:
def update(self, x, y, step):
y_hat = self.predict(x)
g = self.decisionFunc.grad(y, y_hat, x)
learning_rate = self.alpha / np.sqrt(step + 1) # damping step size
learning_rate_p = self.alpha / np.sqrt(step + 2) # damping step size
for i in range(len(x)):
w_e_g = self.w[i] - learning_rate * g[i]
self.w[i] = np.sign(w_e_g) * max(0.,np.abs(w_e_g)-learning_rate_p * self.lambda_)
return self.decisionFunc.loss(y,y_hat)
RDA
L1-RDA與L1-FOBOS不同的是考慮累積梯度的平均值,因此更容易產生稀疏解,具體算法介紹參見博客:
https://zr9558.com/2016/01/12/regularized-dual-averaging-algorithm-rda/
優化目標:
算法更新:
代碼實現:
def update(self, x, y, step):
y_hat = self.predict(x)
g = self.decisionFunc.grad(y, y_hat, x)
self.g = (step-1)/step * self.g + (1/step) * g
for i in range(len(x)):
if (abs(self.g[i])<self.lambda_):
self.w[i] = 0
else:
self.w[i] = -1* (np.sqrt(step)/self.gamma)*(self.g[i] - self.lambda_ * np.sign(self.g[i]))
return self.decisionFunc.loss(y,y_hat)
FTRL
ftrl綜合考慮了RDA和FOBOS的梯度和正則方式,既考慮了之前所有的梯度,又考慮了L1正則與L2正則,優化目標如下:
self.w = np.array([0 if np.abs(self.z[i]) <= self.l1
else (np.sign(self.z[i] * self.l1) * self.l1 - self.z[i]) / (self.l2 + (self.beta + np.sqrt(self.q[i]))/self.alpha)
for i in range(self.dim)])
y_hat = self.predict(x)
g = self.decisionFunc.grad(y, y_hat, x)
sigma = (np.sqrt(self.q + g*g) - np.sqrt(self.q)) / self.alpha
self.z += g - sigma * self.w
self.q += g*g
return self.decisionFunc.loss(y,y_hat)