Machine learning: an introduction to mean squared error and regression lines

Machine learning is a rapidly growing field focused on enabling computers to learn patterns and relationships from data without being explicitly programmed. One of the most important and widely used types of machine learning is supervised learning, where the algorithm learns to map inputs to outputs based on labeled training data.

A key concept in supervised learning is regression, which involves predicting a continuous numeric value. Some common applications of regression include predicting housing prices, stock values, sales figures, or any other important number based on relevant input data.

At the core of regression is the concept of the line of best fit – the line that follows the general trend of the data and minimizes the prediction errors. This article will introduce the concept of mean squared error, a way to measure how well a line fits a set of data points, and show how to find the line that minimizes this error. Understanding these fundamental concepts is crucial for working with linear regression models in machine learning.

Mean Squared Error Definition

When fitting a line to a set of data points, we need some way to measure how well the line models the data. A common metric used is mean squared error (MSE). As the name implies, MSE measures the average squared difference between the actual data points and the predictions from the line:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i – \hat{y}_i)^2 $$

where:

  • $n$ is the number of data points
  • $y_i$ is the actual value of data point $i$
  • $\hat{y}_i$ is the predicted value of data point $i$ based on the line

So for each data point, we calculate the difference between the actual $y$ value and the $y$ value predicted by the line, square this difference, and then take the average of all these squared differences.

Squaring the errors has some important implications:

  1. It makes all errors positive so negative and positive errors can‘t cancel each other out
  2. It weighs larger errors more heavily than smaller ones (e.g. an error of 2 is four times worse than an error of 1)

The goal is to find the line that minimizes MSE – in other words, the line that results in the smallest average squared error across all the data points. Let‘s visualize this concept.

Visualizing MSE

Consider the following data points: (1, 3), (2, 5), (3, 6), (4, 8), (5, 11)

We can plot these points on a 2D graph and draw a line of best fit:

The red line is our predicted line of best fit. For each data point, the light gray vertical line represents the error between the actual value and the value predicted by the red line.

MSE is calculated by measuring each of those gray lines, squaring them, and taking the average. The line that minimizes the sum of the squared gray lines is the line of best fit.

So how do we find that optimal line? It depends on two things: the slope $m$ and the y-intercept $b$ of the line. Our goal is to find the values of $m$ and $b$ that minimize MSE.

Finding the Optimal Slope and Y-Intercept

Consider the general equation for a line:

$$ y = mx + b $$

where:

  • $y$ is the predicted value
  • $m$ is the slope of the line
  • $x$ is the input value
  • $b$ is the y-intercept

Given this, we can rewrite the equation for MSE:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i – (mx_i + b))^2 $$

Our goal is to find the values of $m$ and $b$ that minimize this quadratic function. We can visualize MSE as a 3D surface where the horizontal axes represent possible values for $m$ and $b$, and the vertical axis is the resulting MSE value:

The lowest point on this surface corresponds to the line with the minimum MSE. To find that point, we can use calculus to determine where the partial derivatives of MSE with respect to $m$ and $b$ are both equal to zero.

After solving those equations, we get the following results for the optimal values of the slope and y-intercept:

$$ m = r \frac{s_y}{sx} = \frac{\sum{i=1}^{n}(x_i – \bar{x})(yi – \bar{y})}{\sum{i=1}^{n}(x_i – \bar{x})^2} $$

$$ b = \bar{y} – m\bar{x} $$

where:

  • $\bar{x}$ is the mean of the $x$ values
  • $\bar{y}$ is the mean of the $y$ values
  • $s_x$ and $s_y$ are the standard deviations of $x$ and $y$
  • $r$ is the correlation between $x$ and $y$

Essentially, the optimal slope of the line depends on the correlation between $x$ and $y$ as well as the ratio of their standard deviations. The optimal y-intercept can then be calculated based on the slope and the means of $x$ and $y$.

Numerical Example

Let‘s walk through the calculations with the example data from earlier: (1, 3), (2, 5), (3, 6), (4, 8), (5, 11)

First we calculate the relevant means and variances:

$$ \bar{x} = \frac{1+2+3+4+5}{5} = 3 $$
$$ \bar{y} = \frac{3+5+6+8+11}{5} = 6.6 $$
$$ s_x^2 = \frac{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2}{5} = 2 $$
$$ s_y^2 = \frac{(3-6.6)^2 + (5-6.6)^2 + (6-6.6)^2 + (8-6.6)^2 + (11-6.6)^2}{5} = 8.24 $$

Then we can calculate the correlation:

$$ r = \frac{(1-3)(3-6.6) + (2-3)(5-6.6) + (3-3)(6-6.6) + (4-3)(8-6.6) + (5-3)(11-6.6)}{5 \sqrt{2} \sqrt{8.24}} = 0.9758 $$

Plugging these values into the equations for slope and y-intercept:

$$ m = 0.9758 \frac{\sqrt{8.24}}{\sqrt{2}} = 2.0369 $$
$$ b = 6.6 – 2.0369 * 3 = 0.4894 $$

So the line of best fit for this data is:

$$ y = 2.0369x + 0.4894 $$

We can plot this line to verify visually that it fits the data well:

Applications and Limitations

Finding the line of best fit using mean squared error is the foundation of linear regression, one of the most widely used machine learning techniques. Minimizing MSE results in the line that best captures the linear trend in the data.

Linear regression is often used as a baseline model and can be surprisingly effective for many problems such as predicting housing prices based on square footage or estimating sales based on advertising spend. It‘s also very interpretable since the slope and y-intercept have clear meanings.

However, linear regression also has some key limitations to be aware of:

  1. It assumes the relationship between the input and output is linear. If the true relationship is nonlinear, linear regression may not capture it well.

  2. It is sensitive to outliers since squaring the errors weighs large errors much more heavily. Outliers can significantly change the line.

  3. It assumes the input variables are independent. If inputs are highly correlated (multicollinear), the results can be unreliable.

  4. In high dimensions (many input variables), it can overfit the training data. Regularization techniques are often needed.

For these reasons, having a strong understanding of the data and the model assumptions is critical for successful application of linear regression. Knowing its limitations also motivates the use of more advanced machine learning techniques that can model complex nonlinear relationships.

Conclusion

In this article, we introduced the concept of mean squared error as a way to measure how well a line fits a set of data points. We saw how to derive the equations for the slope and y-intercept of the line that minimizes MSE, resulting in the line of best fit.

This line is the key result of simple linear regression, a foundational machine learning technique for modeling linear relationships between variables. While a powerful tool, linear regression makes strong assumptions that the data scientist must be aware of.

Gaining an intuition for these fundamental concepts is invaluable for further study of machine learning. Building on this foundation, we can explore more advanced regression techniques, regularization methods for preventing overfitting, and the wide world of nonlinear models. An understanding of MSE and regression provides the perfect starting point for this journey.

Similar Posts