數值最優化是很多機器學習中的核心,一旦你已經選定了模型和數據集,那麼就需要通過數值最優化方法去最小化多元函數f(x) 估計出模型的參數:
x∗=argminf(x)
通過求解上面的優化問題,得到的 x∗就是模型最優的參數 。
本文,我重點闡述L-BFGS算法求解無約束優化問題的過程,這也是目前解決機器學習優化問題的常用方法,同時,隨機梯度下降也是比較流行的一種優化方法。最後,我也會用到我比較喜歡的AdaDelta.
Note: Throughout the post, I’ll assume you remember multivariable calculus. So if you don’t recall what a gradient or Hessian is,
you’ll want to bone up first.
牛頓法
大部分的數值優化步驟都是一個迭代更新算法,Most numerical optimization procedures are iterative algorithms which consider a sequence of ‘guesses’ xnxn which
ultimately converge to x∗x∗ the
true global minimizer of ff.
Suppose, we have an estimate xnxn and
we want our next estimate xn+1xn+1 to
have the property that f(xn+1)<f(xn)f(xn+1)<f(xn).
Newton’s method is centered around a quadratic approximation of ff for
points near xnxn.
Assuming that ff is
twice-differentiable, we can use a quadratic approximation of ff for
points ‘near’ a fixed point xx using
a Taylor expansion:
f(x+Δx)≈f(x)+ΔxT∇f(x)+12ΔxT(∇2f(x))Δxf(x+Δx)≈f(x)+ΔxT∇f(x)+12ΔxT(∇2f(x))Δx
where ∇f(x)∇f(x) and ∇2f(x)∇2f(x) are
the gradient and Hessian of ff at
the point xnxn.
This approximation holds in the limit as ||Δx||→0||Δx||→0.
This is a generalization of the single-dimensional Taylor polynomial expansion you might remember from Calculus.
In order to simplify much of the notation, we’re going to think of our iterative algorithm of producing a sequence of such quadratic approximations hnhn.
Without loss of generality, we can write xn+1=xn+Δxxn+1=xn+Δx and
re-write the above equation,
hn(Δx)=f(xn)+ΔxTgn+12ΔxTHnΔxhn(Δx)=f(xn)+ΔxTgn+12ΔxTHnΔx
where gngn and HnHn represent
the gradient and Hessian of ff at xnxn.
We want to choose ΔxΔx to
minimize this local quadratic approximation of ff at xnxn.
Differentiating with respect to ΔxΔx above
yields:
∂hn(Δx)∂Δx=gn+HnΔx∂hn(Δx)∂Δx=gn+HnΔx
Recall that any ΔxΔx which
yields ∂hn(Δx)∂Δx=0∂hn(Δx)∂Δx=0 is
a local extrema of hn(⋅)hn(⋅).
If we assume that HnHnis
[postive definite] (psd) then we know this ΔxΔx is
also a global minimum for hn(⋅)hn(⋅).
Solving for ΔxΔx:
Δx=−H−1ngnΔx=−Hn−1gn
This suggests H−1ngnHn−1gn as
a good direction to move xnxn towards.
In practice, we set xn+1=xn−α(H−1ngn)xn+1=xn−α(Hn−1gn) for
a value of αα where f(xn+1)f(xn+1) is
‘sufficiently’ smaller than f(xn)f(xn).
迭代算法
The above suggests an iterative algorithm:
NewtonRaphson(f,x0):For n=0,1,… (until
converged):Compute gn and H−1n for xnd=H−1ngnα=minα≥0f(xn−αd)xn+1←xn−αdNewtonRaphson(f,x0):For n=0,1,… (until
converged):Compute gn and Hn−1 for xnd=Hn−1gnα=minα≥0f(xn−αd)xn+1←xn−αd
The computation of the αα step-size
can use any number of line search algorithms.
The simplest of these is backtracking
line search, where you simply try smaller and smaller values of αα until
the function value is ‘small enough’.
In terms of software engineering, we can treat NewtonRaphsonNewtonRaphson as
a blackbox for any twice-differentiable function which satisfies the Java interface:
public interface TwiceDifferentiableFunction {
// compute f(x)
public double valueAt(double[] x);
// compute grad f(x)
public double[] gradientAt(double[] x);
// compute inverse hessian H^-1
public double[][] inverseHessian(double[] x);
}
With quite a bit of tedious math, you can prove that for a convex
function, the above procedure will converge to a unique global minimizer x∗x∗,
regardless of the choice of x0x0.
For non-convex functions that arise in ML (almost all latent variable models or deep nets), the procedure still works but is only guranteed to converge to a local minimum. In practice, for non-convex optimization, users need to pay more attention to initialization
and other algorithm details.
Huge Hessians
The central issue with NewtonRaphsonNewtonRaphson is
that we need to be able to compute the inverse Hessian matrix. Note
that for ML applications, the dimensionality of the input to ff typically
corresponds to model parameters. It’s not unusual to have hundreds of millions of parameters or in some vision applications even billions
of parameters. For these reasons, computing the hessian or its inverse is often impractical. For many functions, the hessian may not even be analytically computable, let along representable.
Because of these reasons, NewtonRaphsonNewtonRaphson is
rarely used in practice to optimize functions corresponding to large problems. Luckily, the above algorithm can still work even if H−1nHn−1doesn’t
correspond to the exact inverse hessian at xnxn,
but is instead a good approximation.
Quasi-Newton
Suppose that instead of requiring H−1nHn−1 be
the inverse hessian at xnxn,
we think of it as an approximation of this information. We can generalize NewtonRaphsonNewtonRaphson to
take a QuasiUpdateQuasiUpdate policy
which is responsible for producing a sequence of H−1nHn−1.
QuasiNewton(f,x0,H−10,QuasiUpdate):For n=0,1,… (until
converged)://
Compute search direction and step-size d=H−1ngnα←minα≥0f(xn−αd)xn+1←xn−αd//
Store the input and gradient deltas gn+1←∇f(xn+1)sn+1←xn+1−xnyn+1←gn+1−gn//
Update inverse hessian H−1n+1←QuasiUpdate(H−1n,sn+1,yn+1)QuasiNewton(f,x0,H0−1,QuasiUpdate):For n=0,1,… (until
converged):// Compute search direction and step-size d=Hn−1gnα←minα≥0f(xn−αd)xn+1←xn−αd// Store the input and gradient deltas gn+1←∇f(xn+1)sn+1←xn+1−xnyn+1←gn+1−gn// Update inverse hessian Hn+1−1←QuasiUpdate(Hn−1,sn+1,yn+1)
We’ve assumed that QuasiUpdateQuasiUpdate only
requires the former inverse hessian estimate as well tas the input and gradient differences (snsn and ynyn respectively).
Note that if QuasiUpdateQuasiUpdate just
returns ∇2f(xn+1)∇2f(xn+1),
we recover exact NewtonRaphsonNewtonRaphson.
In terms of software, we can blackbox optimize an arbitrary differentiable function (with no need to be able to compute a second derivative) using QuasiNewtonQuasiNewton assuming
we get a quasi-newton approximation update policy. In Java this might look like this,
public interface DifferentiableFunction {
// compute f(x)
public double valueAt(double[] x);
// compute grad f(x)
public double[] gradientAt(double[] x);
}
public interface QuasiNewtonApproximation {
// update the H^{-1} estimate (using x_{n+1}-x_n and grad_{n+1}-grad_n)
public void update(double[] deltaX, double[] deltaGrad);
// H^{-1} (direction) using the current H^{-1} estimate
public double[] inverseHessianMultiply(double[] direction);
}
Note that the only use we have of the hessian is via it’s product with the gradient direction. This will become useful for the L-BFGS algorithm described below, since we don’t need to represent the Hessian approximation in memory. If you want to see these abstractions
in action, here’s a link to a Java
8 and golang implementation
I’ve written.
Behave like a Hessian
What form should QuasiUpdateQuasiUpdate take?
Well, if we have QuasiUpdateQuasiUpdate always
return the identity matrix (ignoring its inputs), then this corresponds to simple gradient
descent, since the search direction is always ∇fn∇fn.
While this actually yields a valid procedure which will converge to x∗x∗ for
convex ff,
intuitively this choice of QuasiUpdateQuasiUpdate isn’t
attempting to capture second-order information about ff.
Let’s think about our choice of HnHn as
an approximation for ff near xnxn:
hn(d)=f(xn)+dTgn+12dTHndhn(d)=f(xn)+dTgn+12dTHnd
Secant Condition
A good property for hn(d)hn(d) is
that its gradient agrees with ff at xnxn and xn−1xn−1.
In other words, we’d like to ensure:
∇hn(xn)∇hn(xn−1)=gn=gn−1∇hn(xn)=gn∇hn(xn−1)=gn−1
Using both of the equations above:
∇hn(xn)−∇hn(xn−1)=gn−gn−1∇hn(xn)−∇hn(xn−1)=gn−gn−1
Using the gradient of hn+1(⋅)hn+1(⋅) and
canceling terms we get
Hn(xn−xn−1)=(gn−gn−1)Hn(xn−xn−1)=(gn−gn−1)
This yields the so-called “secant conditions” which ensures that Hn+1Hn+1 behaves
like the Hessian at least for the diference (xn−xn−1)(xn−xn−1).
Assuming HnHn is
invertible (which is true if it is psd), then multiplying both sides by H−1nHn−1 yields
H−1nyn=snHn−1yn=sn
where yn+1yn+1 is
the difference in gradients and sn+1sn+1 is
the difference in inputs.
Symmetric
Recall that the a hessian represents the matrix of 2nd order partial derivatives: H(i,j)=∂f/∂xi∂xjH(i,j)=∂f/∂xi∂xj.
The hessian is symmetric since the order of differentiation doesn’t matter.
The BFGS Update
Intuitively, we want HnHn to
satisfy the two conditions above:
- Secant condition holds for snsn and ynyn
- HnHn is
symmetric
Given the two conditions above, we’d like to take the most conservative change relative to Hn−1Hn−1.
This is reminiscent of the MIRA
update, where we have conditions on any good solution but all other things equal, want the ‘smallest’ change.
minH−1s.t. ∥H−1−H−1n−1∥2H−1yn=snH−1 is
symmetric minH−1∥H−1−Hn−1−1∥2s.t. H−1yn=snH−1 is
symmetric
The norm used here ∥⋅∥∥⋅∥ is
the weighted frobenius norm. The
solution to this optimization problem is given by
H−1n+1=(I−ρnynsTn)H−1n(I−ρnsnyTn)+ρnsnsTnHn+1−1=(I−ρnynsnT)Hn−1(I−ρnsnynT)+ρnsnsnT
where ρn=(yTnsn)−1ρn=(ynTsn)−1.
Proving this is relatively involved and mostly symbol crunching. I don’t know of any intuitive way to derive this unfortunately.
This update is known as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, named after the original authors. Some things worth noting about this update:
-
H−1n+1Hn+1−1 is
positive definite (psd) when H−1nHn−1 is.
Assuming our initial guess of H0H0 is
psd, it follows by induction each inverse Hessian estimate is as well. Since we can choose any H−10H0−1we
want, including the II matrix,
this is easy to ensure.
-
The above also specifies a recurrence relationship between H−1n+1Hn+1−1 and H−1nHn−1.
We only need the history of snsn and ynyn to
re-construct H−1nHn−1.
The last point is significant since it will yield a procedural algorithm for computing H−1ndHn−1d,
for a direction dd,
without ever forming the H−1nHn−1 matrix.
Repeatedly applying the recurrence above we have
BFGSMultiply(H−10,{sk},{yk},d):r←d//
Compute right productfor i=n,…,1:αi←ρisTirr←r−αiyi//
Compute centerr←H−10r//
Compute left productfor i=1,…,n:β←ρiyTirr←r+(αn−i+1−β)sireturn rBFGSMultiply(H0−1,{sk},{yk},d):r←d//
Compute right productfor i=n,…,1:αi←ρisiTrr←r−αiyi// Compute centerr←H0−1r// Compute left productfor i=1,…,n:β←ρiyiTrr←r+(αn−i+1−β)sireturn r
Since the only use for H−1nHn−1 is
via the product H−1ngnHn−1gn,
we only need the above procedure to use the BFGS approximation in QuasiNewtonQuasiNewton.
L-BFGS: BFGS on a memory budget
The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the snsn and ynyn vectors
for each iteration. Since one of the core-concerns of the NewtonRaphsonNewtonRaphson algorithm
were the memory requirements associated with maintaining an Hessian, the BFGS Quasi-Newton algorithm doesn’t address that since our memory use can grow without bound.
The L-BFGS algorithm, named for limited BFGS, simply truncates the BFGSMultiplyBFGSMultiplyupdate
to use the last mm input
differences and gradient differences. This means, we only need to store sn,sn−1,…,sn−m−1sn,sn−1,…,sn−m−1 and yn,yn−1,…,yn−m−1yn,yn−1,…,yn−m−1 to
compute the update. The center product can still use any symmetric psd matrix H−10H0−1,
which can also depend on any {sk}{sk} or {yk}{yk}.
L-BFGS variants
There are lots of variants of L-BFGS which get used in practice. For non-differentiable functions, there is an othant-wise
varient which is suitable for training L1L1 regularized
loss.
One of the main reasons to not use L-BFGS is in very large data-settings where an online approach can converge faster. There are in fact online
variants of L-BFGS, but to my knowledge, none have consistently out-performed SGD variants (including AdaGrad or
AdaDelta) for sufficiently large data sets.