Gradient Descent Explained: How It Works & Why It’s Key

Gradient Descent

What is Gradient Descent?

Simple Analogy

Why is Gradient Descent Important for Optimization?

Mathematical Formulation

How Learning Rate & Its Impact

A loss function curve with different gradient descent trajectories. The black curve represents the loss landscape. Green arrows indicate a stable descent path. Blue arrows show a slow-moving descent due to a small learning rate. Red arrows represent oscillations and divergence caused by a large learning rate. The illustration highlights how different learning rates affect the optimization process, with some getting stuck in local minima or overshooting the optimal solution.
Effect of Learning Rate on Gradient Descent
A loss function curve with a gradient descent trajectory represented by blue arrows. The black curve illustrates the loss landscape. The blue arrows depict the effect of a small learning rate, showing a slow descent into a local minimum where the optimization process gets stuck. This highlights the drawback of using a learning rate that is too small, as it may prevent the model from reaching the global minimum.
Getting stuck in a local minimum

Momentum in Gradient Descent

What is Momentum?

Momentum Formula

RMSProp (Root Mean Square Propagation)

Understanding RMSProp

Adam (Adaptive Moment Estimation)

What is Adam Optimizer?

Adam Formula




Implementation in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Generate synthetic data (regression): y = x^3 + noise
N = 200
x = np.linspace(-1, 1, N).reshape(-1, 1)
y = x**3 + 0.1 * np.random.randn(N, 1)

# Convert numpy arrays to torch tensors
x_tensor = torch.tensor(x, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Define a simple feedforward neural network model
def get_model():
    model = nn.Sequential(
        nn.Linear(1, 10),
        nn.ReLU(),
        nn.Linear(10, 1)
    )
    return model

# Training function: runs the training loop and records loss history
def train_model(model, optimizer, x, y, epochs=200):
    criterion = nn.MSELoss()
    loss_history = []
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
        loss_history.append(loss.item())
    return loss_history

# Prepare a dictionary of optimizers to compare
optimizer_configs = {
    "SGD": lambda model: optim.SGD(model.parameters(), lr=0.01),
    "SGD_momentum": lambda model: optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
    "RMSprop": lambda model: optim.RMSprop(model.parameters(), lr=0.01),
    "Adam": lambda model: optim.Adam(model.parameters(), lr=0.01)
}

# Save an initial state dict so that all models start from the same weights
base_model = get_model()
initial_state = base_model.state_dict()

# Dictionary to store the loss history for each optimizer
results = {}

# Train a new model with each optimizer using the same initialization
for opt_name, opt_func in optimizer_configs.items():
    model = get_model()
    model.load_state_dict(initial_state)  # Ensure identical starting weights
    optimizer = opt_func(model)
    loss_history = train_model(model, optimizer, x_tensor, y_tensor, epochs=200)
    results[opt_name] = loss_history

# Plot the loss curves for each optimizer
plt.figure(figsize=(8, 6))
for opt_name, loss_history in results.items():
    plt.plot(loss_history, label=opt_name)
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error (MSE) Loss')
plt.title('Comparison of Optimization Algorithms')
plt.legend()
plt.grid(True)
plt.show()

Result of different Algorithms

A line graph comparing four optimization algorithms (SGD, SGD with Momentum, RMSprop, and Adam) in terms of Mean Squared Error (MSE) loss over 200 training epochs. The x-axis represents the number of epochs, and the y-axis represents MSE loss. The results show that Adam and RMSprop converge faster, while SGD takes longer to reach a lower MSE value.
Comparison of different Algorithms: SGD, Momentum, RNSProp and Adam in MSE

Understanding the Training Process and Results

Why Didn’t We Focus on Standard SGD?

Leave a Reply

Your email address will not be published. Required fields are marked *