Skip to main content

Scientific Computing

Non-linear Optimis...

Line Search M... []

Scientific Computing

Non-linear Optimis...

Derivative-fr... []

This material has been adapted from material by Martin Robinson from the "Scientific Computing" module of the SABS R³ Center for Doctoral Training.

This material has been adapted from material by Martin Robinson from the "Scientific Computing" module of the SABS R³ Center for Doctoral Training.

Creative Commons License
This course material was developed as part of UNIVERSE-HPC, which is funded through the SPF ExCALIBUR programme under grant number EP/W035731/1

This course material was developed as part of UNIVERSE-HPC, which is funded through the SPF ExCALIBUR programme under grant number EP/W035731/1

Creative Commons License

Trust Region Methods

Saddle points

Saddle point pose a particular challenge in non-linear optimisation, particularly in higher dimensions. The plots below show two examples of saddle points in two dimensions. Like local minima and maxima, these are stationary points where the gradient of the function is zero f=0\nabla f = 0, but where the value of the function rises along certain directions and reduces along others (left plot). An alternative type of saddle point arises when the hessian is singular, and are characterised by a plateau around the stationary point, like the monkey saddle depicted in the plot to the right.

Two examples of saddle points

Near the location of the critical point, the function ff can be restated using the quadratic form like so (see also Pascanu, Dauphin and Ganguli 2014):

f(θ+δθ)=f(θ)+12i=1nθλiδvi2f(\mathbf{\theta}^\star + \delta \mathbf{\theta}) = f(\mathbf{\theta}^\star) + \frac{1}{2} \sum_{i=1}^{n_\theta} \lambda_i \delta \mathbf{v}^2_i

where λi\lambda_i is the iith eigenvalue of the Hessian, and vi\nabla \mathbf{v}_i is the motion of δθ\delta \mathbf{\theta} along the iith eigenvector of the Hessian. If the iith eigenvalue is negative/positive then along the iith eigenvector the function ff will achieve a maximum/minimum at θ\mathbf{\theta}^\star.

Gradient descent algorithms will move away or towards θ\mathbf{\theta}^\star with a step given by λiδvi-\lambda_i \delta \mathbf{v}_i. So for negative eigenvalues the motion will be towards lower values of ff away from θ\mathbf{\theta}^\star. For positive eigenvalues the motion will be towards lower values of ff towards θ\mathbf{\theta}^\star. The problem here is the size of the step, which is very small for small values of λi\lambda_i.

Newton methods rescale the step size by λi\lambda_i so that it becomes δvi-\delta \mathbf{v}_i. For negative eigenvalues, this has the undesirable characteristic that these methods move towards increasing values of ff (i.e. towards the critical point) along corresponding eigenvectors. Since for positive eigenvalues it is also moving towards the critical point, this means that saddle points act as attractors for these types of methods.

Trust region methods restate the optimisation problem as a sequence of optimisations of a second order approximation to ff in a local trust-region surrounding the current point aka_k. The exact solution to each of these subproblems can be shown to be (2f(ak)+λtI)1f(ak)(\nabla^2 f(a_k) + \lambda_t I)^{-1} \nabla f(a_k), where the value of λt\lambda_t is related to the size of the trust region. In comparison with the previous methods above, this is equivilent to moving with a step given by λiλi+λtδvi-\frac{\lambda_i}{\lambda_i + \lambda_t}\delta \mathbf{v}_i. As long as λt\lambda_t is chosen to be larger than the most negative eigenvalue then the direction of each step is now always towards more negative values of ff. As long as λt\lambda_t is small compared with λi\lambda_i then we avoid the small step sizes associated with gradient descent.

Trust region methods

Like many line search methods, trust region methods also use the second order Taylor expansion of ff around aka_k

f(ak+p)f(ak)+gkTp+12pTBkp=mk(p)f(a_k + p) \approx f(a_k) + g_k^T p + \frac{1}{2} p^T B_k p = m_k(p)

where gk=f(ak)g_k = \nabla f(a_k), BkB_k is an approximation to the hessian matrix Bk2f(ak)B_k \approx \nabla^2 f(a_k) or the hessian itself Bk=2f(ak)B_k = \nabla^2 f(a_k). Trust region methods aim to find the pp that minimises mkm_k in a local trust region p<Δk||p|| < \Delta_k around the current point aka_k, where Δk\Delta_k is the trust region radius.

Solving the minimisation given above is normally done approximately, with different trust region methods varying how the approximation is achieved. Choosing the trust-region radius is fundamental to this class of methods, and is done by comparing the actual to the predicted reduction in the function value

ρk=f(ak)f(ak+pk)mk(0)mk(pk).\rho_k = \frac{f(a_k) - f(a_k + p_k)}{m_k(0) - m_k(p_k)}.

Since mk(0)mk(pk)m_k(0) - m_k(p_k) is always positive, if ρk\rho_k is negative then the actual function value is increasing, the step is rejected and the trust region radius Δk\Delta_k is decreased in order to improve the approximate model mkm_k. If ρk\rho_k is positive but much smaller than one then we do not alter Δk\Delta_k. If ρk\rho_k is close to or greater than 1 we can be confident in our model and thus increase Δk\Delta_k. The general algorithm for a trust region method (reproduced from the text by Nocedal and Wright cited below) is:

Trust region algorithm

Given a0a_0, Δ^>0\hat{\Delta} > 0, Δ0(0,Δ^)\Delta_0 \in (0, \hat{\Delta}), and ν[0,14)\nu \in [0, \frac{1}{4}):

for k=0,1,2,...k = 0, 1, 2, ...

Obtain pkp_k by (approximately) minimising mk(p)m_k(p) where p<Δk||p|| < \Delta_k > ρk:=f(ak)f(ak+pk)mk(0)mk(pk)\rho_k := \frac{f(a_k) - f(a_k + p_k)}{m_k(0) - m_k(p_k)} > if ρk<14\rho_k < \frac{1}{4}

Δ_k+1:=14Δk\Delta\_{k+1} := \frac{1}{4} \Delta_k > else >> if ρk>34\rho_k > \frac{3}{4} and pk=Δk||p_k|| = \Delta_k

Δ_k+1:=min(2Δk,Δ^)\Delta\_{k+1} := \min(2 \Delta_k, \hat{\Delta}) >> else >>> Δ_k+1:=Δk\Delta\_{k+1} := \Delta_k > if ρ_k>ν\rho\_k > \nu >> a_k+1:=ak+pka\_{k+1} := a_k + p_k
else >> a_k+1:=aka\_{k+1} := a_k

end for.

Solving the trust region subproblem

We will describe two algorithms for minimising mk(p)m_k(p), the Cauchy point and the dogleg methods. The Cauchy point first solves a linear version of mkm_k defined as

pks=minpRnf(ak)+gkTp for pΔkp^s_k = \min_{p \in \mathcal{R}^n} f(a_k) + g_k^T p \text{ for }||p|| \le \Delta_k

Subsequently, pksp^s_k is used to find the scalar τk>0\tau_k > 0 such that

τk=minτ0mk(τpks) for τpksΔk\tau_k = \min_{\tau \ge 0} m_k(\tau p_k^s) \text{ for }||\tau p_k^s|| \le \Delta_k

Finally, the Cauchy point is given as pkC=τkpksp_k^C = \tau_k p_k^s.

The solution to this problem can be shown to be

pkC=τkΔkgkgk,p_k^C = -\tau_k \frac{\Delta_k}{|| g_k ||} g_k,

where

τk={1if gkTBkgk0min(gk3/(ΔkgkTBkgk),1)otherwise.\tau_k = \begin{cases} 1 & \text{if }g_k^T B_k g_k \le 0 \\ \min (||g_k||^3 / (\Delta_k g_k^T B_k g_k), 1) & \text{otherwise}. \end{cases}

The second method we describe is the dogleg method, which is applicable when BkB_k is a positive definite matrix. If the original hessian is positive definite then this method is directly applicable, or one of the quasi-Newton positive definite approximation to the hessian could also be used. The dogleg method is derived by considering the path of the pp that minimises mk(p)m_k(p) with increasing Δk\Delta_k, which forms a curved path in parameter space. The method approximates this path with two straight line segments. The first segment follows the steepest descent direction and is given by

pkU=gkTgkgkTBkgkgkp_k^U = -\frac{g_k^T g_k}{g_k^T B_k g_k} g_k

The second step is along the path between pkUp_k^U and pkB=Bk1gkp^B_k = -B_k^{-1} g_k. In the case where pkBp_k^B is inside the trust region pkBΔk||p_k^B|| \le \Delta_k then pkBp_k^B can be used without modification. Otherwise the point of intersection with the trust-region radius must be calculated, which can be done by solving the following quadratic equation

pkU+(τ1)(pkBpkU)2=Δk2||p_k^U + (\tau - 1)(p_k^B - p_k^U)||^2 = \Delta_k^2

with the second segment being defined by

p~k(τ)={τpkU,0τ1,pkU+(τ1)(pkBpkU),1τ2.\tilde{p}_k(\tau) = \begin{cases} \tau p_k^U, & 0 \le \tau 1, \\ p_k^U + (\tau - 1)(p_k^B - p_k^U), & 1 \le \tau \le 2. \end{cases}

Problems

The Dog Leg

Let f(x)=10(x2x12)2+(1x1)2f(x) = 10 \left( x_2 − x^2_1 \right)^2 + (1 − x_1)^2. At x=(0,1)x = (0,−1) draw the contour lines of the quadratic model

mk(p)=f(ak)+gkTp+12pTBkp,m_k(p) = f(a_k) + g_k^T p + \frac{1}{2} p^T B_k p,

assuming that B_kB\_k is the Hessian of ff. Draw the family of solutions of minpRnmk(p)\min_{p\in \mathcal{R}^n}m_k(p) sothat pΔk||p|| \le \Delta_k as the trust region radius varies from Δk=0\Delta_k = 0 to Δk=2\Delta_k = 2. Repeat this at x=(0,0.5)x = (0, 0.5).

Dogleg method

Write a program that implements the dogleg method. Choose BkB_k to be the exact Hessian. Apply it to minimise the function in (1) from the same two starting points. If you wish, experiment with the update rule for the trust region by changing the constants in the trust region algorithm given above, or by designing your own rules.