Error Bound for Gradient Descent with Approximate Gradient

When the approximate gradient has bounded error

By Christian | March 15, 2018

Hello! So this semester has been a fairly busy one for me and so I have not made much time to get anything new written.. until now!

Lately I've been having some real fun with optimization, especially convex optimization and some duality, but I recently got posed an interesting problem of bounding the error between an iterate and the local minimum for a quadratic form when using an approximate gradient instead of an exact one. I thought this was a fairly fun problem to investigate and thought I would share my proof and some numerical results! So let's get to it..

alt text

First, let us take a look at the recursive gradient descent approach in question. Consider the steepest descent method with bounded error:$\newcommand{\norm}[1]{\left \lVert #1 \right \rVert}$ $\newcommand{\Real}{\mathbb{R}}$

\begin{align*} x_{k+1} &= x_k - s \left( \nabla f(x_k) + \epsilon_k\right) \end{align*}

where $s$ is a positive constant stepsize, $\epsilon_k$ is an error term satisfying $\norm{\epsilon_k} \leq \delta$ for all non-negative integers $k$, and $f$ is the positive definite quadratic form defined as

\begin{align*} f(x) = \frac{1}{2} (x - x^*)^T Q (x - x^*) \end{align*}

We will also let $q = \max \lbrace|1 - s \lambda_{min}(Q)|,|1 - s \lambda_{max}(Q)| \rbrace$. With these definitions, we will aim to prove the following main result:

Main Theorem
If $q < 1$, then under the gradient descent sequence above with bounded error, we have that \begin{align*} \norm{x_k - x^*} \leq \frac{s \delta}{1 - q} + q^k \norm{x_0 - x^*} \end{align*} where $x_0$ is the starting point for the iterate sequence and $x^*$ is the local optima.

To end up with this main theorem, we will first prove some useful lemmas. To this end, we start with the first lemma.

Lemma 1
Given a value $0 \leq c < 1$ and that \begin{align*} \norm{x_{k+1} - x^*} \leq s \delta + c \norm{x_k - x^*} \end{align*} we can find a bound for $\norm{x_k - x^*}$ to be \begin{align*} \norm{x_{k} - x^*} \leq \frac{s \delta}{1 - c} + c^k \norm{x_0 - x^*} \end{align*}
\begin{align*} \norm{ x_{k} - x^* } &\leq s \delta + c \norm{ x_{k-1} - x^* } \\ &\leq \left(1 + c\right)s \delta + c^2 \norm{ x_{k-2} - x^* } \\ &\;\;\vdots \\ &\leq \left(\sum_{i=0}^{k} c^i\right) s \delta + c^{k} \norm{ x_{0} - x^* } \\ &\leq \left(\sum_{i=0}^{\infty} c^i\right) s \delta + c^{k} \norm{ x_{0} - x^* } \\ &= \frac{s \delta}{1-c} + c^{k} \norm{ x_{0} - x^* } \\ \end{align*}
Lemma 2
For some symmetric, positive definite matrix $A$ and positive scalar $s$, the following is true: \begin{align*} \norm{I - s A} &\leq \max \lbrace |1 - s \lambda_{min}(A)|, |1 - s \lambda_{max}(A)|\rbrace \end{align*}
Recall that some positive definite matrix $A = U^T \Lambda U \in \Real^{n \times n}$ where $U$ is a unitary matrix of eigenvectors and $\Lambda$ is a diagonal matrix with positive eigenvalues. Recall as well that for a matrix 2-norm, $\norm{B} = \sqrt{\lambda_{max}(B^T B)}$ for some matrix $B$. With that, we can proceed with the proof. \begin{align*} \norm{ I - s A } &= \norm{U^T U - s U^T \Lambda U} \\ &= \norm{U^T \left(I - s \Lambda \right)U} \\ &\leq \norm{U^T} \norm{I - s \Lambda} \norm{U} \\ &= \norm{I - s \Lambda} \\ &= \sqrt{\max \lbrace (1 - s \lambda_1)^2, \cdots, (1 - s \lambda_n)^2\rbrace} \\ &= \max \lbrace |1 - s \lambda_1|, |1 - s \lambda_2|\cdots, |1 - s \lambda_n|\rbrace \\ &= \max \lbrace |1 - s \lambda_{min}(A)|, |1 - s \lambda_{max}(A)|\rbrace \end{align*}

Proof of Main Theorem

Our goal in this proof is to arrive at the end result by arriving at statements that allow us to benefit from Lemma 1 and Lemma 2. With that said, we can proceed with this proof as follows:
\begin{align*} \norm{x_{k+1} - x^*} &= \norm{\left(x_k - x^*\right) - s \left(\nabla f(x_k) + \epsilon_k\right)} \\ &= \norm{\left(x_k - x^*\right) - s \left(Q \left(x_k - x^*\right) + \epsilon_k\right)} \\ &= \norm{\left(I - s Q\right)\left(x_k - x^*\right) - s \epsilon_k } \\ &\leq \norm{\left(I - s Q\right)\left(x_k - x^*\right)} + s \norm{\epsilon_k} \\ &\leq \norm{ I - s Q }\norm{ x_k - x^* } + s \delta \\ &\leq \max\lbrace |1 - s \lambda_{max}(Q)|,|1 - s \lambda_{min}(Q)|\rbrace \norm{ x_k - x^* } + s \delta \\ &= q \norm{ x_k - x^* } + s \delta \\ \end{align*} Since we assume we choose $s$ to be small enough such that $q < 1$, we can use Lemma 1 to further show that \begin{align*} \norm{x_{k+1} - x^*} &\leq q \norm{ x_k - x^* } + s \delta \\ &\leq \frac{s \delta}{1-q} + q^{k+1} \norm{ x_{0} - x^* } \end{align*} Thus showing the result we want!

Numerical Methods

I thought it could be cool to do a numerical experiment to see how the bound compares to the convergence in practice. To do this experiment, a noise vector $\epsilon$ was added onto the exact gradient of the quadratic form for a random, positive definite $Q$ such that $\left \lVert \epsilon \right \rVert \leq \delta$ for some $\delta$ that is specified. Multiple sequences were run with different starting random seeds and the plot below is a visualization of the convergence results against the bound.

alt text

Based on the figure above, it looks like the bound works out nicely! Some observations to note are the following. As $q \rightarrow 1$, the bound term independent of $k$ in the inequality becomes quite large and the convergence of the $q^k$ term to $0$ slows down. This implies that the spectrum of $s Q$ are within a unit ball centered around a value of $1$ and that the extreme values of $s \lambda_{min}$ and $s \lambda_{max}$ are near the boundary of this ball. $q \rightarrow 1$ also means we are approaching a situation where we will diverge as the number of iterations approach infinity, so it makes sense that things would be bounded more and converge more slowly when $q$ is close to $1$.


In any event, we have seen that if we approximate our gradient in a way that is bounded (say using Finite Differences or mini-batch estimates in Machine Learning), it is possible to bound our convergence error and get a clearer idea of what to expect in terms of run-time, long term accuracy, and potentially more! Thus, this is a pretty neat result and something to keep in mind!