Stochastic gradient descent or SGD is a very popular and powerful algorithm used in machine learning. SGD is an iterative algorithm which is used to compare various solution until the optimum solution is not found. They are widely used in training Neural Network.
Let’s now understand what Stochastic Gradient Descent is, but before that first, let’s understand what Gradient Descent is?
Gradient means a slope either upward or downward and Descent means stepping down on a scale. Hence, gradient descent simply means stepping upward or downward of a slope to reach the lowest or highest point of that slope. In machine learning the objective of gradient descent is that it finds the minimum value of the objective function such that the final result is optimum or satisfactory.
Let’s take an example, imagine you are blindfolded and want to climb to the top of the hill with fewest steps along the way as possible. Initially what you will do is first take big-big steps in the steepest direction of the hill but as you come close to the top of the hill your steps will be smaller and smaller to avoid skipping it. Now instead of climbing up think of gradient descent as hiking down to the bottom valley. This is a better way to visualize this algorithm as it is a minimizing algorithm.
Gradient descent initially will take a bigger step then as it gets closer to the minimum slope of the objective function it’s stepping gets smaller and smaller. Gradient descent always tries to find the minimum value of the slope which is close to zero but not zero because if it gets to zero the model will stop learning.
- Gradient descent starts by picking a random initial value for the parameter. (It is mostly partial derivative with respect to its input).
- Then we update the gradient descent function by giving these parameter values.
- The output of gradient descent function is used in calculating step size step size = gradient * learning rate
Here deciding the value “learning rate” is very important because if the value of the “learning rate” is larger it may skip the optimum point and if it is smaller it will take more time in finding the minimum point.
- Now calculate the new parameter by new params = old params – step size
- This process is repeated until the optimum value or satisfactory value of the objective function is not found.
The downside of the gradient descent algorithm is the amount of computation it takes in each iteration. Suppose we have 50,000 data points and 50 features. Then we calculate derivatives of the function with respect to its feature, so total will be 50000 x 50 = 2500000 computations per iteration. It is common to take at least 1000 iteration so 2500000 x 1000 = 2500000000 computation to complete the algorithm. Which concludes that gradient descent is very slow on huge data.
Stochastic Gradient Descent (SGD)
‘Stochastic’ means involving random variables.Instead of selecting whole data and calculates it’s derivative in each iteration, SGD selects a few samples from the data points called ‘batches’ and compute its derivatives. This step reduces the computation steps enormously. These samples or batches are randomly shuffled on each iteration of the algorithm.
The path taken by SGD to find the optimum value is usually noisier than gradient descent as only one sample from the datasets is taken on each iteration. SGD usually takes more number of iteration to find optimum value than gradient descent but computation cost is much lesser than gradient descent since we are taking only samples instead of whole datasets.
Now let’s look at example on how to implement SGD.
First let’s import our necessary libraries:-
Here we will be using
SGDClassifier which is linear models but uses stochastic gradient descent (SGD) learning.
Now let’s get our datasets. We have used here
make_blobs to create our own datasets that contain 500000 data points with 2 features and 2 centers.
Let us now implement
SGDClassifier on the above datasets: