Making Machine Learning Less Fancy: Gradient Descent

6 min readFeb 15, 2021

Machine learning, is primarily concerned with minimizing the error of a model or making the most accurate predictions possible.

Gradient Descent is one of the most important machine learning algorithms and it finds uses in Linear Regression and Neural Networks.

Gradient Descent is used to reduce a cost(or loss) function. It does this by iteratively tweaking the parameters( or coefficients) of a given function to minimize it to its local minima.

To properly explain gradient descent, I’ll give a brief explanation of some fundamental concepts. I’ll be explaining gradient descent with regards Linear Regression.

Linear Regression

Linear Regression models the relationship between an independent variable (CAUSE; (x)such as size of a house in metres) and a dependent variable(EFFECT; (y)such as price of the house in dollars) . For every value of y , there exists a value of x. These are plotted as shown below.

Independent variables against dependent variable, with data points plotted — Source

Linear regression models this relationship by fitting a linear equation(or a straight line) to the available(observed) data .

This linear equation is represented as ;

y’ = m*x + c

Where y’ represents a predicted value of the independent variable, m represents the slope ( or in more descriptive terms, “how the line tilts”) and c represents the intercept ( where the line cuts the y axis i.e the value of y when x = 0).

Linear regression finds these two coefficients which correspond to a line. This line is then used to predict new values of the dependent variable for a corresponding independent variable ( by checking the value of the y axis for a corresponding value of the x axis on the line).

But how is the line drawn in the first place ?

The line is first drawn with random values of those coefficients(m and c). And then it is updated based on how correct or wrong the line is in predicting values of the current (available/observed) data i.e how do the values of y’ predicted when a value of x is plugged into the linear equation above (with the coefficients plugged in) compare to the actual(observed/available) values of y for those same values of x.

How does linear regression know how correct or wrong the line is ?

This is where the error/cost(or loss) function comes in. The cost function is a representation of how well(or badly) the model is predicting ‘y’ based on its current coefficients. It measures the difference between the actual output “y” and predicted output “ y’ ” from the model. There are many different types of cost functions used in Machine Learning. One of the most common is Mean Squared Error

Mean squared error is calculated as the name implies, by taking the mean of the square of the difference between the actual and predicted values for each training example ( a training example can be something like a single record of a house’s size in metres(x) and its price (y)). The formula is shown below;

Where n represents number of training examples, while y represents the actual values and y’ represents the predicted values ( which are gotten from the linear equation .. y’= mx + b .. . Keeping in mind ‘m’ and ‘b’ are coefficients of the particular line that the loss function is being calculated for). The cost function can be represented as J(m,b) where m and b are the parameters of the linear equation.

We know how badly the model is performing , now what ?

Now THIS is where gradient descent comes in. As defined earlier , gradient descent minimizes this cost function. Why ? Because when the value of the cost function is as low as possible(at its minimum) , the model(i.e line) is predicting values as accurately as possible.

How then does Gradient Descent Work?

The graph below shows the Cost function(J) against the parameter ‘m’ which is the slope of the line (using only the slope in this case to make it as simple as possible) represented as w(weight).

Gradient descent is an iterative(done repeatedly with a small change learnt from the previous execution towards an end goal) process that starts off by randomly initializing a value of the parameters(i.e the initial slope and intercept that correspond to the random initial line drawn) , then taking steps in the direction of the minima. This is done by taking steps in proportion to the derivative of the line at this initial point. The derivative here is the most important part, it refers to the slope of the function at a given point. Calculating this slope enables us to know the direction(sign) and in which to change the coefficients(slope and intercept) to lower the cost function. By this means , gradient descent provides us with new values for the function coefficients(slope and intercept) that reduce the cost function.

The gradient of the point in question is the partial derivative of the cost function with respect to the parameter being updated.

This process(for this case of linear regression with one variable) is represented in the formula shown below:

Where {∂/∂m J(m,c) and ∂/∂c J(m,c)}are the partial derivatives of the cost function with respect to the coefficients of the linear equation ( could also be called features or weight and bias ), ‘:=’ is the assignment or “update” operator and “ alpha 𝛼” is the learning rate.

The process is to be repeated simultaneously( i.e values should be calculated updated at the same time) until convergence ( the point where Gradient Descent makes VERY small changes parameters of the function meaning it is very near the minima if not on it).

The learning rate is one of the most important parts of gradient descent, because it determines how large the “learning steps” to be taken are in the direction of the minima as shown below.

It is therefore important to choose an appropriate learning rate , because a very high learning rate can result in gradient descent missing the minima and never converging , and a very low learning rate can result in taking too much time to reach convergence as shown below.

In Summary,

Gradient Descent is an iterative algorithm that is used to minimize a cost function. This process finds use in many Machine Learning problems and is very much worth further study

Some Notes to look into :

There are Different types of Gradient Descent based on the amount of data used when computing the gradients for each learning learning step;

Batch Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent

Also, note that to keep this post is kept in its most basic and understandable form , Gradient Descent could also have multiple features .

Another thing to look into is Normalisation and how large ranges of values for features ( such as the size of the house in hundreds of square metres , and the price of the house in millions of dollars can affect gradient descent) and how Normalisation helps solve this problem.

And also look into Plotting the cost against time to observe the learning rate

References

Making Machine Learning Less Fancy: Gradient Descent

Written by Oluwanifemi Bamgbose