Adam Optimizer

Overview

SGD (Stochastic Gradient Descent) takes fixed-size steps in the direction of the steepest descent. Each update is simply: \(\theta \leftarrow \theta - \eta \cdot \nabla L\). Simple, but it uses the same learning rate for all parameters and can oscillate or stall.

Adam (Adaptive Moment Estimation) tracks two running statistics per parameter — the mean of past gradients (momentum) and the mean of past squared gradients (adaptive scale). This lets it take larger steps where progress is slow and smaller steps where it’s noisy. It also applies bias correction in early steps.

Mathematical Formulation

Here’s the math in plain terms:

SGD update: \[\theta \leftarrow \theta - \eta \cdot \nabla L\]

Adam update (per parameter): \[m \leftarrow \beta_1 \cdot m + (1-\beta_1) \cdot \nabla L \quad \text{(momentum — smooth gradient)}\] \[v \leftarrow \beta_2 \cdot v + (1-\beta_2) \cdot (\nabla L)^2 \quad \text{(RMS of gradient — tracks curvature)}\] \[\theta \leftarrow \theta - \eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \varepsilon} \quad \text{(bias-corrected, scaled step)}\]

The \(\beta_1\) slider controls how much Adam remembers past gradients (momentum). \(\beta_2\) controls how quickly it adapts to gradient magnitude changes. Try lowering \(\beta_2\) to see Adam become more reactive, or lowering \(\beta_1\) to reduce momentum smoothing.

Note that \(\hat{m}\) is the bias-corrected first moment (average gradient) and \(\hat{v}\) is the bias-corrected second moment (average squared gradient).

Adam usually converges faster than SGD because it adapts to the local curvature of the loss surface.

Iris dataset and the figure above

The Iris dataset (Fisher, 1936; widely used as a toy benchmark) contains 150 examples of iris flowers, 50 each from three species:

  • Iris setosa
  • Iris versicolor
  • Iris virginica

Each example has 4 numeric measurements, in centimeters:

  • sepal length
  • sepal width
  • petal length
  • petal width

The task is to classify each flower into one of the three species based on these measurements so it is a multiclass classification problem.

The interactive demo (and the loss landscape in the figure above) uses only two of those species: 100 flowers (50 per class), in the usual dataset ordering—the first two blocks of 50 samples—so the problem is binary classification, not three-way. Inputs are sepal length and petal length only, each standardized to zero mean and unit variance. The trained model is binary logistic regression:

  • logit \(z = w_1 x_1 + w_2 x_2\)
  • class probability \(\sigma(z)\)

and the objective is mean binary cross-entropy plus L2 regularization on \((w_1, w_2)\) (coefficient \(\lambda = 0.1\) in the code). Two classes keep the learnable parameters to a pair \((w_1, w_2)\) so the loss can be shown as a 2D surface; full three-class softmax would need more weights and a higher-dimensional plot.

The horizontal and vertical axes in the figure above are the coefficients \(w_1\) and \(w_2\)—not raw flower measurements. Contours mark iso-loss levels of that regularized objective. The curves show how SGD and Adam traverse the landscape from the same start point; a full three-class Iris model would use all four measurements, a bias term, and more weights.