Augustin-Louis Cauchy, a mathematician, first invented gradient descent in 1847 to solve calculations in astronomy and estimate stars’ orbits. Learn about the role it plays today in optimizing machine learning algorithms.
Gradient descent is an algorithm you can use to train models in both neural networks and machine learning. It uses a cost function to optimize its parameters, showing the accuracy of a machine learning model under training at each parameter. Gradient descent existed as a mathematical concept before the emergence of machine learning.
A gradient in vector calculus is similar to the slope but applies when you have three or more variables. It becomes the vector of partial derivatives for all independent variables and is denoted f for the maximum gradient increase or -∇f the maximum gradient decrease of the function.
Gradient descent requires knowledge of calculus for its implementation in machine learning. Continue reading for a basic understanding of gradient descent, what people use it for, the different types, and how it works in machine learning.
Gradient descent works with convex functions and finds the fewest and most accurate amount of steps toward the lowest point of a curve, optimizing the path. Let’s go over a couple of terms that inform gradient descent before examining how they work:
Parameters: The coefficients of the function that minimize the cost
Cost function: Also called the “loss function” in machine learning, is the difference between the actual and predicted value at the present position. A model stops learning once this function gets as close as possible to 0.0
Learning rate: Sometimes referred to as the step rate or alpha, is the magnitude of the steps the function takes as it minimizes the cost
The primary function of gradient descent is to find the parameters that best minimize the cost by having the coefficients make the cost equal to or as close to 0.0 as it can.
The goal is to have the cost = 0.0 or the closest acceptable minimum. To calculate this, start by writing the cost function.
Write the cost function as cost =f(x)with x as the coefficient.
Use a starting coefficient of 0.0 or any small number.
Take the derivate or partial derivative if multiple variables are present to find the gradient to know which direction to move in on the curve.
Once you have a gradient (the derivative of the cost function) and know which way to move, use your learning rate to tell how much the value of the coefficient changes every calculation.
Repeat until the cost is zero or as close as it can reach.
Gradient descent involves knowledge of calculus, but its implementation is always the same series of steps.
Machine learning uses two main types of gradient descent:
Batch gradient descent: Provides updating for the machine learning model after each training epoch by averaging the error of predictions and actual outcomes of the cost function at each iteration.
Stochastic gradient descent (SGD): Calculates the error for every sample in the data set, needing a prediction for every iteration of the training epoch, recalculating each coefficient every instance.
Batch gradient descent and stochastic gradient descent have unique advantages and disadvantages when calculating gradient descent in machine learning. Let’s take a look at each:
Batch gradient descent | Stochastic gradient descent |
---|---|
Has a higher efficiency in computing | Takes more computing power |
Has a lower update frequency, leading to a more stable learning rate as it reaches 0.0. | Has a higher update frequency, leading to a faster learning rate and quicker insights into model performance. |
Since running the entire batch of data is slower, batch gradient descent can reach 0.0 without optimizing the coefficients. | Since SGD makes a prediction every step, it leads to more accurate predictions before reaching 0.0. |
More work to store in memory for large data sets because all data must fit. | Easier to run large data sets since SGD runs one training epoch at a time. |
Batch gradient descent is a common approach to machine learning, but stochastic gradient descent performs better on larger data sets.
If you need to use aspects of both batch gradient descent and SGD, consider using a method called mini-batch gradient descent that combines them. It still uses batches but breaks up a data set into small batches, each providing the updates from SGD as they perform gradient descent. It makes the learning of each batch quick while also remaining computationally fast. This method is standard in machine learning, the training of neural networks, and deep learning applications.
While gradient descent is an efficient way to optimize machine learning algorithms, you will find some common problems the algorithm runs into that may leave you with models that aren’t fully optimized. With graphs that are not entirely convex parabolas, points other than the global minimum can make the cost function equal to 0.0. These two points are:
Local minima: These give a slope of 0.0 and seem like global minimum points to the algorithm but are just local minimum points before the cost function increases again before reaching the global minimum
Saddle points: Give a slope of 0.0 at a series of points where the cost function stops steadily decreasing before it continues its descent to the global minimum
In applying gradient descent to deep learning neural networks, two issues arise:
Vanishing gradients: Occur during the backpropagation of neural networks, making the gradient too small, leading to an eventual coefficient of zero, resulting in the network stopping learning
Exploding gradients: Occur when a model becomes unstable due to a gradient that is too large, leading to coefficients that no longer become calculable numbers because of a complex machine learning algorithm
Learn more about optimizing your machine learning models using gradient descent by taking online courses. For example, you can explore the Mathematics for Machine Learning and Data Science Specialization from DeepLearning.AI on Coursera to learn the fundamental mathematics you will need. The Specialization helps you cultivate the calculus knowledge and skills necessary to perform gradient descent for machine learning applications.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.