How to Use Linear Models and Decision Trees in Julia

As a data scientist or machine learning engineer, one of the most critical choices is selecting the appropriate model for a given problem. Should you try a complex deep learning architecture? Or will a simpler model like linear regression or a decision tree suffice and be more interpretable? While experimentation is often necessary to find the optimal approach, understanding the strengths and weaknesses of different model families can guide your decisions.

In this article, we‘ll take a deep dive into two foundational supervised learning techniques: linear models and decision trees. We‘ll explore when to use each one, detailed mathematical formulations, and walk through examples of training and evaluating these models in Julia. By the end, you‘ll have a solid grasp of these core ML concepts and be well-equipped to apply them in your own projects.

Linear Models: Predicting Outcomes with Straight Lines

Linear models are a class of models that assume a linear relationship between the input features and outcome variable. The most well-known type is linear regression, where the goal is to predict a continuous numerical value. For example, you might use linear regression to model the relationship between a house‘s square footage and its sale price, or a student‘s number of hours studied and their exam score.

Simple Linear Regression

In simple linear regression, there is a single input feature (independent variable) used to predict the outcome (dependent variable). Mathematically, the model can be expressed as:

y = β₀ + β₁x + ε

where:

  • y is the outcome variable
  • x is the input feature
  • β₀ is the y-intercept (value of y when x = 0)
  • β₁ is the slope coefficient (change in y for a one unit increase in x)
  • ε is the error term (difference between predicted and actual y)

The model‘s goal is to find the values of β₀ and β₁ that minimize the sum of squared errors (SSE) over the training data:

SSE = Σ(yᵢ – ŷᵢ)²

where yᵢ is the actual outcome value and ŷᵢ is the predicted value for the i-th observation.

This approach is known as ordinary least squares (OLS) regression. The closed-form solution for the OLS coefficients is:

β̂ = (XᵀX)⁻¹Xᵀy

where X is the matrix of input features (with a column of 1s prepended for the intercept term) and y is the vector of outcome values.

Multiple Linear Regression

In multiple linear regression, there are two or more input features. The model takes the form:

y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε

where xᵢ is the i-th input feature and βᵢ is its corresponding coefficient.

The OLS solution remains the same as in simple regression, with X now containing p columns (plus the intercept).

Assumptions of Linear Regression

For the OLS estimates to be unbiased and efficient, several assumptions must hold:

  1. Linearity: The relationship between features and outcome is linear.
  2. Independence: The observations are independently sampled.
  3. Homoscedasticity: The variance of the errors is constant across all levels of the features.
  4. Normality: The errors are normally distributed with mean zero.

Violations of these assumptions can lead to biased coefficient estimates, misleading inference, and poor model performance. It‘s important to check diagnostics like residual plots and Q-Q plots to assess if the assumptions are reasonable.

Fitting Linear Models in Julia

Let‘s see how to fit a multiple regression model in Julia using the GLM.jl package. We‘ll use the mtcars dataset and predict miles per gallon (MPG) from engine displacement (Disp), horsepower (HP), and weight (Wt).

First, load the necessary packages and data:

using GLM, DataFrames, RDatasets

mtcars = dataset("datasets", "mtcars")

Next, fit the multiple regression model:

ols = lm(@formula(MPG ~ Disp + HP + Wt), mtcars)

We can extract the coefficient estimates and their standard errors:

coef(ols)
stderror(ols)

To assess the model‘s overall fit, we can look at the R² value and F-statistic:

r2(ols)
fstat(ols)

The R² indicates the proportion of variance in the outcome explained by the features, while the F-statistic tests the null hypothesis that all coefficients (except the intercept) are zero.

We can also compute predictions on new data:

new_data = DataFrame(Disp=[160,200], HP=[110,130], Wt=[2.5,3.0])
predict(ols, new_data)

Regularized Linear Models

One issue with OLS regression is that it can overfit the training data, especially when there are many features relative to the number of observations. Regularization techniques like ridge regression and lasso add a penalty term to the OLS objective to constrain the coefficient estimates.

Ridge regression adds an L2 penalty, which shrinks the coefficients towards zero:

minimize: SSE + λ Σ βⱼ²

where λ controls the strength of regularization.

Lasso (least absolute shrinkage and selection operator) uses an L1 penalty:

minimize: SSE + λ Σ |βⱼ|

The L1 penalty has the effect of setting some coefficients exactly to zero, performing automatic feature selection.

These regularized models can be fit in Julia using the glmnet function from GLMNet.jl:

using GLMNet

X = Matrix(select(mtcars, Not(:MPG)))
y = mtcars.MPG

ridge_model = glmnet(X, y, alpha=0)
lasso_model = glmnet(X, y, alpha=1)

The alpha parameter controls the type of regularization, with 0 corresponding to ridge and 1 to lasso.

Decision Trees: Modeling Complex Relationships

Decision trees are a non-parametric supervised learning method that can be used for both regression and classification. They work by recursively partitioning the feature space into distinct regions and making a prediction for each region.

The tree is built through a greedy process called recursive binary splitting. At each step, the algorithm selects the feature and split point that results in the greatest reduction in impurity, a measure of how mixed the outcome classes are in each resulting subset. This process continues until a stopping criterion is met, such as reaching a maximum depth or minimum number of observations per leaf.

For regression trees, the prediction for each leaf is the mean of the outcome values in that region. For classification trees, it is the majority class.

Strengths of Decision Trees

Decision trees have several advantages compared to linear models:

  1. Non-linearity: They can capture complex non-linear relationships between features and the outcome. Each split can be thought of as an if-else condition.

  2. Interpretability: The learned tree structure can be easily visualized and understood. Each prediction can be traced back to a series of split decisions.

  3. Automatic feature selection: The recursive splitting process naturally selects the most informative features.

  4. Minimal preprocessing: Features don‘t need to be scaled or transformed, and missing values can be handled without imputation.

Impurity Measures

The choice of impurity measure determines how the tree is built. For classification, common measures are Gini impurity and entropy.

Gini impurity for a node t is defined as:

G(t) = 1 – Σ p(j|t)²

where p(j|t) is the proportion of observations in class j at node t.

Entropy is defined as:

H(t) = – Σ p(j|t) log₂ p(j|t)

For regression, the impurity measure is typically mean squared error (MSE):

MSE(t) = (1 / N) Σ (yᵢ – ȳ)²

where yᵢ are the outcome values of observations in node t and ȳ is their mean.

Building Decision Trees in Julia

Let‘s build a decision tree classifier for the iris dataset using the DecisionTree.jl package:

using DecisionTree, RDatasets

iris = dataset("datasets", "iris")
features = Matrix(iris[:, 1:4])
labels = string.(iris.Species) # convert levels to strings

model = build_tree(labels, features)

To visualize the learned tree structure:

using D3Trees

D3Tree(model, "iris_tree")

Iris decision tree

We can assess the model‘s accuracy on a held-out test set:

train_features, test_features, train_labels, test_labels =
    train_test_split(features, labels, 0.7)

model = build_tree(train_labels, train_features)
preds = apply_tree(model, test_features)
accuracy = sum(preds .== test_labels) / length(test_labels)
println("Accuracy: $(accuracy)")

Feature Importance

Decision trees provide a natural way to assess feature importance based on how much each feature contributes to reducing impurity across all splits. In Julia, we can compute feature importances using the feature_importances function:

importances = feature_importances(model)
println(importances)

This returns a vector with the normalized importance scores for each feature.

We can also visualize the importances using a bar plot:

using Plots

bar(importances, 
    xticks=(1:4, ["Sepal length", "Sepal width", "Petal length", "Petal width"]),
    xlabel="Feature", ylabel="Importance", legend=false)

Feature importances

Limitations and Extensions

While powerful, a single decision tree is prone to overfitting and high variance. Two popular extensions that address these issues are random forests and gradient boosted trees.

Random forests build an ensemble of trees, each fit on a bootstrap sample of the data and a random subset of features. The final prediction is the majority vote (classification) or average (regression) of the individual trees. This reduces overfitting and improves generalization.

Gradient boosting iteratively fits a sequence of shallow trees, each attempting to correct the errors of the previous trees. The final prediction is a weighted sum of all the tree outputs. XGBoost is a highly optimized implementation of gradient boosting that has been widely successful in data science competitions.

Conclusion

In this article, we explored two fundamental machine learning techniques: linear models and decision trees. We saw that linear models are effective when the relationship between features and outcome is approximately linear, while decision trees can capture more complex non-linear patterns.

Some key takeaways:

  • Always visualize your data to check for linearity before using a linear model
  • Consider regularized linear models like ridge or lasso when there are many features
  • Decision trees are interpretable and require minimal preprocessing
  • Random forests and gradient boosting can improve the performance of individual trees

Julia‘s GLM.jl and DecisionTree.jl packages provide a straightforward interface for applying these models to real datasets.

Of course, we‘ve only scratched the surface of these rich topics. To dive deeper, check out these additional resources:

  • An Introduction to Statistical Learning (book)
  • The Elements of Statistical Learning (book)
  • Decision Trees (scikit-learn documentation)
  • Random Forests (Berkeley CS 189 lecture)

Happy modeling!

Similar Posts