近几年优化方法因机器学习变得流行起来,从随机梯度下降,BFGS到adam,大多数机器学习问题能被抽象化为优化问题。所以想要真正理解机器学习算法,熟悉各类优化算法是如何找到最优解,以及面对各类优化问题我们应该如何选择是非常重要的。因此我将在UCI Optimization课程中所学到的东西记录下来,用于加深理论理解,同时也分享给同样感兴趣的人。
Optimization has been a hit recently because of the surge of machine learning and deep learning. From stochastic gradient, BFGS to adam, the genuie problem of most machine learning problem can be melt down to a optimization problem, it’s essential for us to understand how do these methods lead us to the optimal, and which methods to choose when facing different kind of optimization problems. Therefore I took down what I’ve learnt from the optimization course to solidify my theorical foundation and gain a deeper understanding of what happend behind all these optimization process, as well as sharing and discussing with you.
Historical Sketch
优化在研究和工业界中应用十分广泛;早在1960年非线性优化就被Schmit应用于结构设计,1961被Johnson应用于工程优化。如今应用从验证蛋白分子结构到追踪电磁波遍地开花,数十年来飞机机翼的大小都是通过优化方法决定。
Optimizations are wildly used in research and industrial; Nonlinear optimization methods were applied in structural design early back in 1960 by Schmit, engineering optimization in 1961 by Johnson. Today, applications are everywhere, from identifying structures of protein molecules to tracing electromagnetic rays, and optimization has been used for decades in sizing airplane wings.
Gradient-based Method
| When | Who | What |
|---|---|---|
| 1847 | Cauchy | First gradient method |
| 1943 | Courant | Penalty |
| 1951 | Karush, Kuhn, Tucker | KKT optimality conditions for constrained problems |
| 1964 | Fletcher and Reeves | Conjugate gradient (unconstrained method) |
| 1971 | Brent and others | hybrid polynomial-interval methods |
| Interval search using Golden section & Fibonacci numbers | ||
| 1970s | Sequential quadratic programming(SQP) for constrained minimizations | |
| 1984 | Karmarkar | Interior methods for linear programming |
Non-gradient Method
| When | Who | What |
|---|---|---|
| 1952 | Bellmen | Dynamic programming |
| 1953 | Metropolis | Simulated annealing |
| 1960 | Rosenbrock | Orthogonal directions |
| 1961 | Hooke and Jeeves | Pattern search |
| 1964 | Powell | Conjugate directions |
| 1965 | Nelder and Meade | Simplex method, |
| 1967 | Duffin, Peterson, Zener | Geometric programming |
| 1975 | Holland | Genetic algorithms |
Optimization modeling
Most engineering problems can be expressed as minimizing (add a minus if maximizing), which has a general form:
minimize $ f(x)$
subject to $g_i(x)\le0, i=1,…,m$
and $h_j(x) = 0, j=1,….,l$
and $x^L\le x\le x^U$
where $x= (x_1,x_2,….,x_n)^T$
Example
Shortest distance from a point to a line
Given a line $$a_0+a_1x_1 + a_2x_2=0$ and a point $x^0=(x_1^0, x_2^0)$$
minimize $$f=(x_1-x_1^2)^2 + (x_2-x_2^0)^2$$
subject to $$h(x) \equiv a_0+a_1x_1+a_2x_2=0$$
In matrix form:
minimize $$f=(\boldsymbol{x} - \boldsymbol{x}^0)^T(\boldsymbol{x} - \boldsymbol{x}^0) $$
subject to $$h(\boldsymbol{x}) \equiv \boldsymbol{a}^T \boldsymbol{x} - b = 0$$
Using method of Lagrange multipliers, obtained $$\boldsymbol{x}^* = \boldsymbol{x}^0 - \frac{(\boldsymbol{a}^T\boldsymbol{x}^0-b)}{\boldsymbol{a}^T\boldsymbol{a}}\boldsymbol{a}$$
Thus the shortest distance d is $$d=\frac{|\boldsymbol{a}^T \boldsymbol{x}^0-b|}{\sqrt{(\boldsymbol{a}^T \boldsymbol{a})}}$$
This problem can be generalized to a point to any dimension hyperplane.
Theoretical Foundation
Weierstrass Theorem
Weierstrass theorem (Extrem value theorem) states that if a real-valued function $f(x)$ is continuous on the closed interval [a,b], then $f(x)$ must attain a maximum and a minimum, each at least once.
i.e. There must exist real numbers m and M such that: $m<f(x)<M$, for all $x\in [a,b]$
Quadratic Forms
Consider $f=x_1^2 - 6x_1x_2+9x_2^2$, we can rewrite it in Quadratic Form:
$$f= (x_1\ x_2) \left[
\begin{matrix}
1 & -3 \
-3 & 9 \
\end{matrix}\right] \left(\begin{matrix}x_1 \ x_2 \end{matrix}\right)$$
More simple as $f=\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} $, $\boldsymbol{x}$ is an n-vector, $\boldsymbol{A}$ is the coeffiecient matrix, and it’s symmetric, also can be expressed in index notation as $\sum_i\sum_jx_ia_{ij}x_j$.
Positive Definite Matrices
What is Positive Definite?
A quadratic form is positive definite if $\boldsymbol{y}^T \boldsymbol{A} \boldsymbol{y} >0$ for any nonzero vector $\boldsymbol{y}$ , and $\boldsymbol{y}^T \boldsymbol{A} \boldsymbol{y} >0$ if and only if $\boldsymbol{y}=0$ .All eigenvalues of a positive definite matrix are strictly positive.
$\boldsymbol{A}$ is negative definite if ($-\boldsymbol{A}$ ) is positive definite.
A quadratic form is positive semidefinite if $\boldsymbol{y}^T \boldsymbol{A} \boldsymbol{y} \ge 0 $ for any nonzero $\boldsymbol{y}$. All eigenvalues of a positive semidefinite matrix are nonnegative.
Sylvester’s test for a positive definite matrix: Denote $\boldsymbol{A}^i$ the submatrix of up-left square matrix with $i$ length, $\boldsymbol{A}$ is positive definite if and only if det($\boldsymbol{A}^i$) > 0, for $i=1,2,…,n$.
$\boldsymbol{C^n}$Continuity
We say the function $f$ is $C^0$ continuous at a point $\boldsymbol{a}$ if: given any sequence {$\boldsymbol{x}_k$} in $D(f)$ which converges to $\boldsymbol{a}$, then $f(\boldsymbol{x}_k)$ must converge to $ f(\boldsymbol{a})$.
$C^1$ and $C^2$ Continuity is considering the continuity of $d(f)$ and $d^2(f)$.
Gradient Vector and Hessian Matrix
Given a function $f(x) \in C^1$, the gradient vector $\nabla f = [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, …, \frac{\partial f}{\partial x_n}]^T$
Directional derivative of $f$ is the derivative of $f$ in any unit direction $\boldsymbol{s}$ at a point $\boldsymbol{c}$, obtained as: $$D_sf(\boldsymbol{c}) = lim_{t\to0}\frac{1}{t} {f(\boldsymbol{c}+t\boldsymbol{s})-f(\boldsymbol{c})} = \nabla f(\boldsymbol{c})^T\boldsymbol{s}$$
The Hessian Matrix
$$\nabla^2f \equiv \left[ \begin{matrix} \frac{\partial^2f}{\partial x_1 \partial x_1} & \frac{\partial^2f}{\partial x_1 \partial x_2} & … & \frac{\partial^2f}{\partial x_1 \partial x_n}\ \ & \frac{\partial^2f}{\partial x_2 \partial x_2} & … & \frac{\partial^2f}{\partial x_2 \partial x_n}\ &…\ symmetric & & &\frac{\partial^2f}{\partial x_n \partial x_n}\ \end{matrix}\right] $$
The Hessian is related to the convexity of the function, the Hessian of $f$ at a point $\boldsymbol{c}$ is denoted as $\nabla^2f(\boldsymbol{c})$.
Newton’s forward difference formula
Newton’s forward difference formula is a numerical evaluation of gradient.
Newton’s forward difference formula is often utilized to obtain an approximate expression for $\nabla f$ at a given point. The general formula is $$\frac{\partial f}{\partial x_i} \approx \frac{f(x_1^0 , x_2^0 … x_i^0 + \epsilon … x_n^0) - f(\boldsymbol{x}^0)}{\epsilon}, i=1,…n$$
For greater accuracy, the central difference formula choose the different value between $x_i^0-0.5\epsilon$ and $x_i^0+0.5\epsilon$.
Taylor’s Theorem & Quadratic Approximation
1-Dimension: $$q(x) = f(x_0) + f^\prime(x_0)(x-x_0) + \frac{1}{2}f^{\prime\prime}(x_0)(x-x_0)^2$$
n-Dimension: $$f(\boldsymbol{x}) = f(\boldsymbol{x}^0) + \nabla f(\boldsymbol{x}^0)^T(\boldsymbol{x}- \boldsymbol{x}^0) + \frac{1}{2}(\boldsymbol{x}-\boldsymbol{x}^0)^T[\nabla^2f(\boldsymbol{z})](\boldsymbol{x}-\boldsymbol{x}^0)$$
Linearly Independence
Linearly Independence of Vectors: $\boldsymbol{x}_3$ can not be formed by $\boldsymbol{x}_1$ and $\boldsymbol{x}_2$, then $\boldsymbol{x}_3$ is independent of $\boldsymbol{x}_1$ and $\boldsymbol{x}_2$.
Definition: If linear combination $\sum_{k=1}^m\alpha_k\boldsymbol{x}^k=0$ implies that all $\alpha_k$ are zero.
Convexity
Convex set: If for any two points in the set, every point on the line joining the two points is in the set.
$\forall x_1, x_2, 0\leq \alpha \leq1$, the point $\alpha x_1 + (1-\alpha)x_2$ is in the set.
Convex domain: For domain $\Omega$, $ 0\leq \lambda \leq1, \forall \boldsymbol{x},\boldsymbol{y} \in \Omega, \ \ (1-\lambda)\boldsymbol{x} + \lambda\boldsymbol{y} \in \Omega$
Convex function: $\forall \boldsymbol{x}_1. \boldsymbol{x}_2 \in R^d, 0\leq \lambda \leq1$
$$f(\lambda \boldsymbol{x}_1 + (1-\lambda)\boldsymbol{x}_2)\leq \lambda f(\boldsymbol{x}_1) + (1-\lambda) f(\boldsymbol{x}_2)$$
Properties of convex function:
$f^{\prime \prime}(x) \ge 0$ in $\Omega$
$f\subset C^1$, $f$ is convex over $\Omega$ if and only if $\forall x,y \ f(y) \geq f(x) + f^\prime(x)(y-x)$
$f\subset C^1$, $f^\prime(x^\star)(y-x^\star)\ge0$ for every y in S, then $x^\star$ is the global minimum
$f\subset C^2$, $f$ is convex over $\Omega$ <--> $f(y) \geq f(x) + \nabla f(x)^T(y-x), \forall x,y \in \Omega$ -->
$f$ is convex and $x^\star$ is the global minimum if $\nabla f(x^\star) = 0$ (Can be proved by #2)
Necessary & Sufficient Condition for Optimality
Necessary Condition for $x^\star$ to be a local minimum is: $$\nabla f(\bold{x}^\star)=0$$
It’s call a stationary point(驻点), it can be a minimum or maximum, or a saddle point(鞍点).
Sufficient Condition for $x^\star$ to be a strict local minimum are: $$\nabla f(\bold{x}^\star)=0$$, $$\nabla^2f(\bold{x}^\star)$$ is positive definite.
The Hessian of f is positive definite means it’s convex on its domain, in this condition, the stationary point must the the global minimum.
Optimization Methods
According to problem types, we can roughly separate optimization methods into:
Discrete/ Continuous;
Unconstrained/ Constrained;
Subjected to application level, these methods can be categorized by input information:
$C^0$ input: Only know the $f(x)$ itself;
$C^1$ input: know $f(x)$ and $df(x)$ ;
$C^2$ input: know $f(x), df(x)$ and $d^2f(x)$ ;
Optimization Algorithms Implementaion Links
I’ve implemented several optimization methods mentioned in this series, there are detailed hyperparameter settings, stopping criterion choices, test functions and experiments, all in jupyter notebook on my github, links are as follows:
Golden Section & Coordinate Descent
Steepest Descent & Conjugate Gradient
Simulated Annealing
We will explore and implement several traditional optimization methods to get a deeper view in the following essays.
若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏
扫描二维码,分享此文章
