Convergence of SGD


By Ali Zindari

Stochastic gradient descent (SGD) is widely being used in Machine Learning (ML) and Deep Learning (DL) nowadays. This is due to the fact that computing the gradient over the full dataset is computationally expensive given the massive real-world datasets. In this regard, SGD works perfectly with much lower computations compared to GD. This algorithm samples one data point randomly at each step and updates the weights only with respect to that data point. In this blog, I will provide the convergence proof for SGD with different settings.

The goal is minimizing an objective function on a dataset of N samples.

`f(x) = \sum_{i=1}^N f_i(x)`

Here is the update rule for SGD:

`x_{t+1} = x_t - \eta_t g_t(x_t)`

Where `g_t` is the stochastic gradient at point `x_t`, `g_t = \nabla f_i(x_t)` and `\mathbb{E} [g_t] = \nabla f(x_t)`. It means that `g_t` is an unbiased estimator of the true gradient.
------------------------------------------------------------------------------------------------------------------

Proof 1: I assume that we want to minimize a function `f: \mathbb{R}^d \rightarrow \mathbb{R}` which is convex and differentiable. I also assume that `\mathbb{E}[||g_t(x_t)||^2] < G^2` and `||x_0 - x^\star||^2 < R^2`.

`||x_{t+1} - x^\star||^2 = ||x_t - \eta_t g_t - x^\star||^2 = ||x_t - x^\star||^2 + \eta_t^2||g_t||^2 -2\eta_t g_t^T(x_t - x^\star) `

`\rightarrow \mathbb{E}[||x_{t+1} - x^\star||^2|x_t] = \mathbb{E}[||x_{t} - x^\star||^2|x_t] + \eta_t^2 \mathbb{E}[||g_t||^2 |x_t] -2\eta_t \mathbb{E}[g_t^T | x_t](x_t - x^\star)`

`\rightarrow \mathbb{E}[||x_{t+1} - x^\star||^2|x_t] < ||x_{t} - x^\star||^2 + \eta_t^2G^2-2\eta_t \nabla f(x_t)^T(x_t - x^\star)`

From the convexity of `f` we can say:
`f(y)>f(x) + \nabla f(x)^T(y-x) \rightarrow f(x_t)-f(x^\star)> \nabla f(x_t)^T(x_t-x^\star)`

`\rightarrow \mathbb{E}[||x_{t+1} - x^\star||^2|x_t] < ||x_{t} - x^\star||^2 + \eta_t^2G^2-2\eta_t[f(x_t) - f(x^\star)]`

`\rightarrow \mathbb{E}[||x_{t+1} - x^\star||^2 - ||x_{t} - x^\star||^2 ] < \eta_t^2G^2-2\eta_t[f(x_t) - f(x^\star)]` `\rightarrow \sum_{t=0}^{T-1} \mathbb{E}[||x_{t+1} - x^\star||^2 - ||x_{t} - x^\star||^2 ] < \sum_{t=0}^{T-1} \eta_t^2G^2- \sum_{t=0}^{T-1} 2\eta_t[f(x_t) - f(x^\star)]` `\rightarrow \mathbb{E}[||x_{T} - x^\star||^2 - ||x_{0} - x^\star||^2 ] < T \eta_t^2G^2- \sum_{t=0}^{T-1} 2\eta_t[f(x_t) - f(x^\star)]` `\rightarrow \sum_{t=0}^{T-1} 2\eta_t[f(x_t) - f(x^\star)] < T \eta_t^2G^2 + ||x_0 - x^\star||^2 - ||x_T - x^\star||^2` `\rightarrow \sum_{t=0}^{T-1} 2\eta_t[f(x_t) - f(x^\star)] < T \eta_t^2G^2 + R^2`
`\rightarrow \frac{1}{T} \sum_{t=0}^{T-1} f(x_t) - f(x^\star) < \frac{\eta_t G^2}{2} + \frac{R^2}{2\eta_t T}`

Now we take the derivative of right hand side w.r.t `\eta_t` to minimize and and find a fixed `\eta`.

`\frac{d}{d \eta_t} (frac{\eta_t G^2}{2} + \frac{R^2}{2\eta_t T}) = 0 \rightarrow \eta_t = \frac{R}{G \sqrt{T}}`

Then we replace this learning rate and we have:

`\rightarrow \frac{1}{T} \sum_{t=0}^{T-1} f(x_t) - f(x^\star) < \frac{RG}{\sqrt{T}}`