ArticlesMachine Learning

What is Gradient Descent in Machine Learning?


What is Gradient Descent?

Gradient descent is a fundamental optimisation algorithm. It is used in machine learning to minimise a given cost function by iteratively adjusting the parameters of a model. It works by computing the gradient of the cost function with respect to the parameters and updating the parameters in the opposite direction of the gradient. And as such, gradually moving towards the minimum of the cost function. Whether it’s a linear regression or a complex neural network, the fundamental goal remains the same: to find the optimal parameters that best fit the given data.

The intuition behind gradient descent derives from the concept of derivatives from calculus. The gradient represents the direction of the steepest ascent of a function. Conversely, the negative gradient points towards the direction of the steepest descent. By iteratively adjusting the parameters in the opposite direction of the gradient, the algorithm converges towards the minimum of the cost function.

Gradient descent plays a crucial role in training various machine learning models, by efficiently finding the optimal set of parameters that best fit the training data. It stands out as a cornerstone technique, indispensable for optimising models across various domains. Its elegance lies in its simplicity and effectiveness, making it a workhorse algorithm in the field.

Mechanics

What is Gradient Descent in Machine Learning?

1. Initialisation

The process begins with initialising the parameters of the model with random values. These parameters could be weights in a neural network or coefficients in a regression model.

2. Compute Gradient

Next, the gradient of the cost function with respect to each parameter is computed. This gradient indicates the slope of the cost function in the parameter space.

3. Update Parameters

The parameters are then updated in the direction opposite to the gradient. This update is determined by a learning rate hyperparameter, which controls the size of the steps taken towards the minimum.

4. Convergence

Steps 2 and 3 are repeated iteratively until a stopping criterion is met. This criterion could be a predefined number of iterations or a threshold for the change in the cost function.

Variations

Gradient descent comes in various flavors, each with its own characteristics suited for different scenarios.

Let’s explore briefly each one of them.

  • Batch Gradient Descent
    This is the vanilla version of all types, where the the computation of the gradient uses the entire dataset. While accurate, it can be computationally expensive for large datasets.
  • Stochastic Gradient Descent (SGD)
    In SGD, the gradient is computed using a single randomly chosen data point or a small batch of data points. It is much faster than batch variant but may exhibit more variance in the parameter updates.
  • Mini-batch Gradient Descent
    A compromise between batch and stochastic variations, mini-batch gradient descent computes the gradient using a small subset of the data. It combines the advantages of both approaches, offering faster convergence while maintaining stability.

Challenges and Considerations

Learning Rate Tuning: Choosing an appropriate learning rate is crucial for the convergence and stability of the algorithm. A learning rate that is too small may lead to slow convergence, while a large learning rate can cause divergence.

Local Minima and Plateaus: Gradient descent is susceptible to getting stuck in local minima or plateaus. As such, the gradient becomes close to zero, hindering further progress towards the global minimum.

Saddle Points: In high-dimensional spaces, saddle points pose a challenge. These points have zero gradients in some directions but are not necessarily minima, leading to slow convergence.