Building and Training Linear and Logistic Regression Models in Python: A Comprehensive Guide

Linear and logistic regression are two foundational algorithms in the world of machine learning. Despite their simplicity, these techniques remain powerful tools in a data scientist‘s toolkit and are widely used to model relationships between variables and make predictions.

In this in-depth tutorial, we‘ll walk through how to implement linear and logistic regression models in Python. We‘ll cover everything from preparing your data to training and evaluating your models, with plenty of code examples along the way. By the end, you‘ll be equipped with the knowledge and skills to apply these techniques to your own datasets and prediction problems. Let‘s dive in!

Linear Regression 101

Linear regression models the relationship between a dependent variable y and one or more explanatory variables X. It finds the linear function that best fits the data by minimizing the sum of squared residuals between predicted and actual y values.

Some common use cases of linear regression include:

  • Predicting sales revenue based on advertising spend
  • Modeling the relationship between years of experience and salary
  • Forecasting housing prices based on square footage

The simple yet powerful idea behind linear regression is that the dependent variable can be estimated using a linear combination of the independent variables:

y = β0 + β1X1 + β2X2 + … + βpXp

Where:

  • y is the dependent variable
  • X1, X2, …, Xp are the independent variables
  • β0, β1, …, βp are the regression coefficients that we want to estimate

So how do we actually build a linear regression model in Python? We‘ll use the popular scikit-learn library which provides simple yet efficient tools for predictive data analysis. But first, we need some data!

Generating Sample Data

For this example, we‘ll generate a simple synthetic dataset with a single explanatory variable. Let‘s imagine we work for an ecommerce company and want to predict monthly sales (y) based on the number of website visitors (X).

Here‘s how we can generate this data in Python:

import numpy as np

n = 100
X = np.random.normal(100, 20, n).reshape(-1, 1)
y = 10 + 5*X + np.random.normal(0, 10, (n, 1))

In this code:

  • We generate 100 observations (n = 100)
  • For the explanatory variable X, we sample from a normal distribution with mean 100 and standard deviation 20
  • For y, we use the linear function y = 10 + 5*X and add some random noise

We now have a simple dataset with 100 observations, let‘s plot it to visualize the relationship between X and y:

import matplotlib.pyplot as plt

plt.scatter(X, y)
plt.xlabel(‘Website Visitors‘)
plt.ylabel(‘Sales ($)‘)
plt.title(‘Sales vs Website Visitors‘)
plt.show()

scatter plot of sales vs website visitors

We can see there is a clear positive linear relationship – as the number of visitors increases, so do the sales. Our goal is to find the linear regression equation that best captures this relationship.

Building the Model

With our data ready, we can now build the linear regression model in a few lines of code using scikit-learn. First, let‘s split our data into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This splits our data into 80% training and 20% test sets. The train set will be used to fit the model while the test set will be used for evaluation.

Next, we import the LinearRegression class and create an instance of the model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

Finally, we train (fit) the model on the training data:

model.fit(X_train, y_train)

That‘s it! We‘ve built our linear regression model. Let‘s take a look at the model‘s learned coefficients:

print(model.coef) # Output: [[4.95]] print(model.intercept) # Output: [11.79]

The model has learned that the regression equation is roughly: sales = 11.79 + 4.95 website_visitors. This aligns well with the true equation we used to generate the data (y = 10 + 5X).

Making Predictions and Evaluating Performance

With our trained model, we can now use it to make predictions on new data. Let‘s use our test set:

y_pred = model.predict(X_test)

To evaluate how well our model is performing, we can calculate metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE) and R-squared:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print(‘MAE:‘, mean_absolute_error(y_test, y_pred))
print(‘MSE:‘, mean_squared_error(y_test, y_pred))
print(‘R-squared:‘, r2_score(y_test, y_pred))

On our test set, we get:

MAE: 10.03
MSE: 143.34
R-squared: 0.91

The R-squared value of 0.91 indicates that our simple model is able to explain 91% of the variance in the sales data. Not bad for a few lines of code!

We can also plot the actual vs predicted values to visually assess the model‘s performance:

plt.scatter(y_test, y_pred)
plt.xlabel(‘Actual Sales ($)‘)
plt.ylabel(‘Predicted Sales ($)‘)
plt.title(‘Actual vs Predicted Sales‘)
plt.show()

scatter plot of actual vs predicted sales

The data points fall close to the diagonal line, indicating the predictions are tracking well with the actual values.

Extending to Multiple Linear Regression

In the previous example, we only used a single explanatory variable (website visitors) to predict sales. But what if we have multiple variables that could impact sales? This is where multiple linear regression comes in.

The process of building a multiple linear regression model is very similar – you simply need to provide a matrix of explanatory variables (X) instead of a single vector.

Let‘s say in addition to website visitors, we also had data on the amount spent on online advertising and the average time spent on the website. Our input data matrix X would look like:

X = [[150, 500, 60],
[200, 800, 90],
[100, 200, 30],
…]

Where each row represents an observation and the columns are visitors, ad spend, and average time on site respectively.

The code to build and train the model using this data would be exactly the same! The only difference is that the learned model coefficients would be a vector instead of a single value.

Feature Selection and Regularization

In a real-world setting, you‘ll likely have many potential explanatory variables to choose from. Including irrelevant features can lead to overfitting, reduced model interpretability and increased computational cost.

Feature selection techniques aim to identify the most relevant variables to include in your model. Some common approaches include:

  • Backward elimination: Start with all features and iteratively remove the least significant ones
  • Forward selection: Start with no features and iteratively add the most significant ones
  • Recursive Feature Elimination: Recursively remove features and build a model on remaining attributes

Another approach to prevent overfitting is regularization. Techniques like Lasso (L1) and Ridge (L2) regression introduce a penalty term to the loss function that shrinks model coefficients towards zero. This can help create simpler, more generalizable models.

To use L2 regularization with scikit-learn, you can use the Ridge class instead of LinearRegression:

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

The alpha parameter controls the strength of regularization. Higher values lead to more shrinkage of coefficients.

Logistic Regression for Classification

While linear regression is used to predict continuous values, logistic regression is used for classification tasks where we want to predict a binary outcome (yes/no, true/false, 1/0).

Some examples of problems well suited for logistic regression are:

  • Predicting if an email is spam or not
  • Classifying a tumor as malignant or benign
  • Determining if a customer will churn or not

Under the hood, logistic regression uses the logistic (sigmoid) function to squash the output of a linear equation between 0 and 1. This output can be interpreted as the probability of the positive class.

Titanic Survival Prediction

To illustrate logistic regression in action, let‘s work with the well-known Titanic dataset. This dataset contains information about passengers on the Titanic and whether they survived or not. Our goal is to build a model that predicts the survival of a passenger based on features like age, sex, passenger class etc.

First, let‘s load the data and take a peek:

import pandas as pd

df = pd.read_csv(‘https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv‘)
df.head()

head of titanic dataframe

The ‘Survived‘ column is our target variable – 1 indicates the passenger survived, 0 means they did not.

Data Preparation

Before we can build our model, we need to prepare the data. This involves:

  1. Handling missing values
  2. Converting categorical variables to numeric
  3. Splitting into train and test sets

Let‘s start by checking for missing values:

df.isnull().sum()

missing values in titanic data

There are quite a few missing values in the Age, Cabin and Embarked columns. For simplicity, we‘ll impute the missing ages with the median and drop the other columns.

df[‘Age‘].fillna(df[‘Age‘].median(), inplace=True)
df.drop([‘Cabin‘, ‘Embarked‘], axis=1, inplace=True)

Next, we need to convert the categorical Sex column to numeric. We can do this using pandas‘ get_dummies function:

df = pd.get_dummies(df, columns=[‘Sex‘], drop_first=True)
df.head()

titanic data with sex encoded

We‘ve encoded ‘male‘ as 1 and ‘female‘ as 0.

Finally, let‘s split our data into training and test sets and separate the features (X) and target variable (y):

from sklearn.model_selection import train_test_split

X = df.drop(‘Survived‘, axis=1)
y = df[‘Survived‘]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Building and Evaluating the Model

With our data prepared, we can now train our logistic regression model:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

Let‘s use our model to make predictions on the test set and evaluate performance:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

print(‘Accuracy:‘, accuracy_score(y_test, y_pred))
print(‘Confusion Matrix:\n‘, confusion_matrix(y_test, y_pred))
print(‘Classification Report:\n‘, classification_report(y_test, y_pred))

Output:
Accuracy: 0.79
Confusion Matrix:
[[97 12] [26 44]] Classification Report:
precision recall f1-score support

       0       0.79      0.89      0.84       109
       1       0.79      0.63      0.70        70

accuracy                           0.79       179

macro avg 0.79 0.76 0.77 179
weighted avg 0.79 0.79 0.78 179

Our simple logistic regression model achieves an accuracy of 79% on the test set. The confusion matrix provides more detail on the model‘s performance:

  • 97 passengers correctly predicted as not survived (true negatives)
  • 44 passengers correctly predicted as survived (true positives)
  • 12 passengers predicted as survived but did not (false positives)
  • 26 passengers predicted as not survived but did (false negatives)

The classification report provides additional metrics like precision, recall and F1-score for each class.

There is certainly room for improvement in this model. Some next steps could be:

  • Feature engineering: Creating new features like family size, title, etc.
  • Hyperparameter tuning: Trying different values for regularization strength (C)
  • Trying other algorithms: Decision trees, random forests, etc.

Wrapping Up

In this tutorial, we‘ve covered a lot of ground! We‘ve seen how to implement both linear and logistic regression models in Python using scikit-learn. We‘ve generated our own data for linear regression and worked with the classic Titanic dataset for logistic regression.

Along the way, we‘ve covered important concepts like:

  • Splitting data into training and test sets
  • Handling missing values and categorical variables
  • Evaluating model performance using metrics like R-squared, confusion matrix, etc.
  • Regularization techniques to prevent overfitting

I encourage you to apply what you‘ve learned here to your own datasets and problems. Experiment with different features, tweak hyperparameters, and see how you can improve your model‘s performance. Happy coding!

Resources for Further Learning

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Geron
  • An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
  • Coursera‘s Machine Learning course by Andrew Ng
  • scikit-learn documentation: https://scikit-learn.org/

Similar Posts