Skip to main content

Scientific Computing

Non-linear Optimis...

Non-linear Op... []

Scientific Computing

Non-linear Optimis...

Trust Region ... []

This material has been adapted from material by Martin Robinson from the "Scientific Computing" module of the SABS R³ Center for Doctoral Training.

This material has been adapted from material by Martin Robinson from the "Scientific Computing" module of the SABS R³ Center for Doctoral Training.

Creative Commons License
This course material was developed as part of UNIVERSE-HPC, which is funded through the SPF ExCALIBUR programme under grant number EP/W035731/1

This course material was developed as part of UNIVERSE-HPC, which is funded through the SPF ExCALIBUR programme under grant number EP/W035731/1

Creative Commons License

Line Search Methods

Gradient descent

One of the simplest local optimisation algoriths is gradient descent. It is initialised at some point in parameter space a0a_0, and at each iteration the function f(x)f(x) is reduced by following the direction of steepest descent f(a)-\nabla f(a)

an+1=anγf(an)a_{n+1} = a_n - \gamma \nabla f(a_n)

This is an example of an important class of algorithms called the line search methods. These algorithms choose a search direction pkp_k at each iteration kk, and search along the 1D line from the initial point aka_k to a new point

ak+1=ak+αpka_{k+1} = a_k + \alpha p_k

with a lower function value. The problem at each iteration becomes a one-dimensional optimisation problem along pkp_k to find the optimal value of α\alpha. Each line search algorithm is thus defined on how it chooses both the search direction pkp_k and the optimal α\alpha.

animation illustrating gradient descent

Plateaus with low gradient

An obvious downside to simple gradient descent can be seen for functions which have regions of zero or small gradients, or plateaus. Here a gradient descent algorithm with a constant γ\gamma will proceed very slowly, if at all. This motivates another important line search algorithm, Newtons method.

The Newtons direction pkNp^N_k can be derived by considering the second-order Taylor expansion of the function f(x)f(x)

f(ak+p)f(ak)+pTf(ak)+12pT2f(ak)p=mk(p).f(a_k + p) \approx f(a_k) + p^T \nabla f(a_k) + \frac{1}{2} p^T \nabla^2 f(a_k) p = m_k(p).

We find the value of pp that minimises mk(p)m_k(p) by setting the derivative of mkm_k to zero, leading to

pkN=(2f(ak))1f(ak)p_k^N = - (\nabla^2 f(a_k))^{-1} \nabla f(a_k)

Unlike the steepest descent, Newtons method has a natural step length α1\alpha \approx 1, which is suitable for a wide variety of problems and can quickly cross areas of low gradient. Naturally, since the algorithm is based on a second-order approximation of the function ff, it works better if this approximation is reasonably accurate.

Newtons method can be used as long as the inverse of the second derivative of the function (2f(ak))1(\nabla^2 f(a_k))^{-1}, exists (e.g. it will always exist for a positive definite 2f\nabla^2 f). However, even when this inverse does exist it is possible that the direction pkNp^N_k does not satisfy the descent condition f(ak+αpkN)<f(ak)f(a_k + \alpha p^N_k) < f(a_k) (or equivilently f(ak)TpN<0\nabla f(a_k)^T p^N < 0), so many modifications to Newtons methods, falling under a class of methods called Quasi-Newton methods, have been proposed to satisfy this descent condition.

Quasi-Newton methods do not require the (often onerous) calculation of the hession 2f(x)\nabla^2 f(x) like Newtons, instead they form an approximation to the hessian Bk2f(ak)B_k \approx \nabla^2 f(a_k) that is updated at each step using the information given by the gradient evaluations f(ak)\nabla f(a_k). Two popular methods of performing this update are the symmetric-rank-one (SR1), and the Broyden, Fletcher, Goldfarb, and Shanno, (BFGS) formula. Once the approximation BkB_k is formed then the search direction is calculated via

pk=Bk1f(ak)p_k = -B_k^{-1} \nabla f(a_k)

For more details of other line search methods, please see Chapter 3 of the Nocedal and Wright textbook, or in the other textbooks listed at the end of this lesson. Finally, it should be noted that the conjugate gradient method can also be used for non-linear optimisation, where the search direction is given by

pk=f(ak)+βkpk1p_k = -\nabla f(a_k) + \beta_k p_{k-1}

Step length

In line search methods, choosing the step length αk\alpha_k is a non-trivial task. Ideally we would want to chose αk\alpha_k to minimise the function along the one-dimensional search direction pkp_k. That is, we wish to minimise

ϕ(αk)=f(ak+αkpk), αk>0.\phi(\alpha_k) = f(a_k + \alpha_k p_k),\text{ }\alpha_k > 0.

In general it is too expensive to do this minimisation exactly, so approximate methods are used so that multiple trial αk\alpha_k values are trialled, stopping when a candidate is found that satisfies a set of conditions. There are two main conditions used, the Wolfe conditions and the Goldstein conditions.

The two Wolfe conditions are the sufficient decrease condition, which ensures that the reduction in the function value is proportional to the step length αk\alpha_k and the gradient in the direction of the step

f(ak+αkpk)f(ak)+c1αkf(ak)Tpk.f(a_k + \alpha_k p_k) \le f(a_k) + c_1 \alpha_k \nabla f(a_k)^T p_k.

The second Wolfe condition is the curvature condition, which prevents unacceptibly short steps by ensuring that the slope of ϕ\phi is greater than some constant c2c_2 times the initial slope ϕ(0)\phi'(0)

f(ak+αkpk)Tpkc2f(ak)Tpk,\nabla f(a_k + \alpha_k p_k)^T p_k \ge c_2 \nabla f(a_k)^T p_k,

where c2(c1,1)c_2 \in (c_1, 1). Typical values are c1=104c_1 = 10^{-4} and c2=0.9c_2 = 0.9. The strong Wolfe conditions restrict the gradient ϕ\phi' to be small, so as to exclude points that are too far from stationary points of ϕ\phi

f(ak+αkpk)f(ak)+c1αkf(ak)Tpk.f(a_k + \alpha_k p_k) \le f(a_k) + c_1 \alpha_k \nabla f(a_k)^T p_k.
f(ak+αkpk)Tpkc2f(ak)Tpk,|\nabla f(a_k + \alpha_k p_k)^T p_k| \le c_2 |\nabla f(a_k)^T p_k|,

The Goldstein conditions are similar in spirit to the Wolfe conditions, and are formed from the two inequalities

f(ak)+(1c)αkf(ak)Tpkf(ak+αkpk)f(ak)+cαkf(ak)Tpk.f(a_k) + (1 - c) \alpha_k \nabla f(a_k)^T p_k \le f(a_k + \alpha_k p_k) \le f(a_k) + c \alpha_k \nabla f(a_k)^T p_k.

with 0<c<1/20 < c < 1/2. The first inequality prevents small step sizes while the second is the same sufficient decrease condition as in the Wolfe conditions. The Goldstein conditions are often used in Newton-type methods but for quasi-Newton methods the Wolfe conditions are prefered. The diagrams from the text by Nocedal and Wright illustrate the two conditions

Wolf versus Goldstein conditions

Algorithms for choosing candidate step size values αk\alpha_k can be complicated, so we will only mention here one of the simplest, which is the backtracking method. This approach implicitly satisfies the condition on too small αk\alpha_k, and only repeatedly test for the common sufficient decrease condition that appears in both the Wolfe and Goldstein condtitions.

Backtracking algorithm

Choose αˉ>0\bar{\alpha} > 0, ρ(0,1)\rho \in (0, 1), c(0,1)c \in (0, 1)

αk:=αˉ\alpha_k := \bar{\alpha}

repeat until f(ak+αkpk)f(ak)+cαkf(ak)Tpkf(a_k + \alpha_k p_k) \le f(a_k) + c \alpha_k \nabla f(a_k)^T p_k

αk:=ραk\alpha_k := \rho \alpha_k

end repeat.

Software

  • Scipy has a wide variety of (mostly) line search and trust region algorithms in scipy.optimize. There are 14 local minimisers, so we won't list them all here!

  • It is worth noting that Scipy includes the line_search function, which allows you to use their line search satisfying the strong Wolfe conditions with your own custom search direction.

  • Scipy also includes a HessianUpdateStrategy, which provides an interface for specifying an approximate Hessian for use in quasi-Newton methods, along with two implementations BFGS and SR1.

Problems

Line Search Problems

  1. Program the steepest descent and Newton algorithms using the backtracking line search. Use them to minimize the Rosenbrock function. Set the initial step length α0=1\alpha_0 = 1 and print the step length used by each method at each iteration. First try the initial point x0=(1.2,1.2)Tx_0 = (1.2, 1.2)^T and then the more difficult starting point x0=(1.2,1)Tx_0 = (−1.2, 1)^T.

  2. Plot the function surface using matplotlib and overlay the line search segments so you can visualise the progress of your algorithm. Observe the difference between the algorithms when the gradient of the rosenbrock function is low (i.e. at the bottom of the curved valley)

  3. Repeat (1) and (2) above using the line search implemented in Scipy scipy.optimize.line_search, which uses the strong Wolfe conditions.