Gradient Descent

What does Gradient Descent do

It minimizes cost function using derivatives by proceeding in epochs (an epoch consists of using the training set entirely to update each parameter).

The adaptive methods in gradient descent can automatically adjust the learning rate.

Newton's method
- computes and uses the full Hessian
- works for convex problems but needs adjustments for nonconvex problems
preconditioning
- avoids computing the Hessian in its entirety but only computes the diagonal entries

Vanishing gradients
- the gradients of a neural network's loss function become extremely small
- causes the weights to update very slowly and making it difficult for the network to learn effectively
Exploding gradients
- gradients become excessively large
- leads to unstable updates and causing the model's parameters to oscillate or diverge during training

Initialization = refers to how the network’s weights are set before training begins.
Common initialization strategies:
- Xavier/Glorot Initialization: tailored for tanh activations; for balanced signal flow
- He Initialization: tailored for ReLU activations.
- Pretrained weights: from transfer learning or self-supervised learning.
- bias terms are often initialized to zero or small constants.

Model convergence can be slow or stall due to:
- Poor gradient flow
- Suboptimal hyperparameters
- Lack of normalization or regularization

Gradient checking
- use only for debugging gradients (e.g., to verify backpropagation implementation)
- ensure to include regularization terms in the check
- doesn't work for dropout
Hyperparameter Tuning: Hyperparameter Tuning#Hyperparameter Tuning in Deep Learning
Batch Normalization: Regularization#batch-normalization

Variants Outline:

most basic approach
the loss function= an average of the losses computed on every single example in the dataset

a variant of batch gradient descent where the weights are updated after computing the gradient of the loss function with respect to a single training example at a time
while each individual observation will provide a poor estimate of the true gradient, given enough randomness the parameters will converge to a good global estimate
SGD itself has various “upgrades” variants
- AdaGrad
  - a version of SGD that scales $α$ for each parameter according to the history of gradients. As a result, $α$ is reduced for very large gradients and vice-versa
  - used per-coordinate scaling to allow for a computationally efficient preconditioner
- Momentum (Gradient Descent#Momentum): helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations
- RMSprop (Gradient Descent#RMSprop (Root Mean Square prop)) and Adam (Gradient Descent#Adam (Adaptive Moment Estimation)): frequently used in neural networks

batch ones are slow; stochastic ones can avoid local minimum but can be noisy and computationally inefficient
different ways to update weights

Mini-batch a compromise between batch and stochastic gradient descent, where the weights are updated after computing the gradient of the loss function with respect to a small batch of training examples (typically between 32 and 512).
efficient multi-machine, multi-GPU and overall parallel processing
Comparison: SGD vs. mini-batch gradient descent
when to use
- when the data size is very large (>= 2000)
what to use
- if $n_{b a t c h}$ = m: batch gradient descent
- if $n_{b a t c h}$ = 1: stochastic gradient descent
- usually $n_{b a t c h} = 2^{n}$ : 64, 128, ...
it should fit in CPU/GPU memory

(Exponentially weighted moving averages)

The idea behind gradient descent with momentum is to add a "momentum" term to the weight update rule (stochastic gradient descent) that encourages the weight updates to continue in the same direction as the previous updates
it aggregates a history of past gradients to accelerate convergence and reduce oscillations in the optimization process
The weight update rule for gradient descent with momentum can be written as:

v (t) = γ * v (t - 1) + l e a r n i n g r a t e * g r a d i e n t

w e i g h t (t) = w e i g h t (t - 1) - v (t)

where

it works by maintaining a moving average of the squared gradients and using that to scale the learning rate
it decouples per-coordinate scaling from a learning rate adjustment
coefficient $γ$ determines how long the history is when adjusting the per-coordinate scale

Adam is a combination of two other optimization algorithms, namely, stochastic gradient descent with momentum and RMSprop.

reduce the learning rate over time as the optimization process progresses.
the idea behind learning rate decay is that as the algorithm gets closer to the minimum of the loss function, it may benefit from smaller steps to avoid overshooting or oscillations around the minimum.
can be combined with other algorithms such as SGD and Adam