Machine Learning Directly in SQL – How to Use ML in Databases

As a full-stack developer, you‘re likely very familiar with using SQL databases like PostgreSQL, MySQL, and SQL Server to store and query data. But did you know that you can also use these databases to build, train, and make predictions with machine learning models?

That‘s right – many popular SQL databases now have built-in machine learning capabilities that allow you to do ML directly where your data resides, without having to move it to a separate tool. In this article, I‘ll show you how it works. We‘ll cover the advantages of this approach, which databases support it, and walk through hands-on examples of building ML models in SQL. Let‘s dive in!

Why Machine Learning in SQL Databases?

Traditionally, the process of building machine learning models has involved extracting data from databases, loading it into dedicated ML tools, training models there, and then deploying the models separately. While this approach works, it introduces a lot of complexity in terms of data movement, transformations, deployment, and management of separate ML infrastructure.

More recently, SQL databases have begun offering built-in machine learning capabilities. This has several major advantages:

  • Simplicity: Your data is already in the database, so you can build and train models directly on it without moving or transforming it. This simplifies data workflows.

  • Performance: Since you‘re not moving data out of the database, you avoid the time and overhead of data transfer. The database can also optimize execution of ML training and inference queries.

  • Management: ML models can be stored as database objects, inheriting the DB‘s management capabilities around access control, backup/restore, high availability, etc. There‘s no need for separate ML infrastructure.

  • Real-time predictions: With ML models living directly in the database, applications can easily use them to make real-time predictions on the latest data with simple SQL queries.

Machine Learning Capabilities by Database

Here‘s an overview of the machine learning capabilities offered by some of the most popular SQL databases:

  • PostgreSQL: Offers a extension called MADlib that provides a suite of ML algorithms including linear and logistic regression, k-means, decision trees, random forests, and more.

  • MySQL: Includes a set of ML functions for creating and training models using gradient boosting, decision trees, random forests, and other algorithms. Predictions can be made with SQL queries.

  • SQL Server: Provides ML Services that allow training models using popular languages like R and Python, with both open-source and Microsoft algorithms. Includes real-time scoring and native T-SQL prediction functions.

  • Oracle: The Oracle Machine Learning product offers in-database algorithms for regression, classification, clustering, anomaly detection, and more – using SQL, R, and Python APIs.

Other databases like IBM Db2, SAP HANA, Google BigQuery ML, Amazon Redshift ML, and more offer similar ML capabilities. The key is that all of these allow building and using ML models with SQL skills, without separate ML tools.

Hands-On Examples

Now let‘s walk through a few examples of building machine learning models in SQL databases. We‘ll use PostgreSQL and its MADlib extension here, but the general concepts apply to other databases as well. You‘ll just need to look up the specific syntax.

Linear Regression

Let‘s start with a simple linear regression model to predict housing prices based on square footage.

First, assume we have a table called housing with the following schema:

CREATE TABLE housing (
  id SERIAL PRIMARY KEY,
  sqft INTEGER,
  price INTEGER
);

We can train a linear regression model on this data and store it in the database using MADlib‘s linregr_train function:

DROP TABLE IF EXISTS housing_model, housing_model_summary;

SELECT madlib.linregr_train (
  ‘housing‘,              -- Source table
  ‘housing_model‘,        -- Output model table
  ‘price‘,                -- Target column
  ‘ARRAY[1, sqft]‘,       -- Predictor columns
  NULL,                   -- Grouping columns
  20,                     -- Number of iterations
  ‘housing_model_summary‘ -- Output summary table
);

This trains the model for 20 iterations using sqft as the predictor variable and price as the target. The resulting model is stored in the housing_model table.

We can then use the model to make predictions on new data using the linregr_predict function:

SELECT 
  id,
  sqft,
  price,
  madlib.linregr_predict ( ARRAY[1, sqft], m.coef ) AS predicted_price  
FROM 
  housing h,
  housing_model m
WHERE
  price IS NULL;  

This applies the model to rows where price is currently NULL, adding a predicted_price column with the model‘s estimate.

Logistic Regression

For a classification example, let‘s build a logistic regression model to predict customer churn based on their demographics and activity.

Assume we have a table called customers with the following schema:

CREATE TABLE customers (
  id SERIAL PRIMARY KEY,  
  age INTEGER,
  income INTEGER,
  activity INTEGER,
  churned INTEGER
);

The churned column is 1 if the customer churned (left the company) or 0 if they did not.

We can train a logistic regression churn prediction model using MADlib like this:

DROP TABLE IF EXISTS churn_model, churn_model_summary;

SELECT madlib.logregr_train (
  ‘customers‘,            -- Source table
  ‘churn_model‘,          -- Output model table 
  ‘churned‘,              -- Target column
  ‘ARRAY[1, age, income, activity]‘, -- Predictor columns  
  NULL,                   -- Grouping columns
  20,                     -- Number of iterations
  ‘churn_model_summary‘   -- Output summary table
);

This trains the model using age, income, and activity as predictor variables to estimate the probability of churn.

We can then apply the model to new data to predict churn probability per customer:

SELECT
  id, 
  age,
  income, 
  activity,
  churned,
  madlib.logregr_predict ( ARRAY[1, age, income, activity], m.coef ) AS churn_probability
FROM
  customers c,
  churn_model m
WHERE 
  churned IS NULL;

The churn_probability column will contain the model‘s estimated probability that each customer will churn, based on their attributes.

K-Means Clustering

As a final example, let‘s look at an unsupervised ML technique – k-means clustering. We can use this to segment customers based on their attributes.

Using the same customers table as before, we can build a k-means model using MADlib‘s kmeans function:

DROP TABLE IF EXISTS cust_cluster_model;

SELECT madlib.kmeans (
  ‘customers‘,            -- Source table
  ‘cust_cluster_model‘,   -- Output model table
  ‘ARRAY[age, income, activity]‘, -- Attributes to cluster on
  4,                      -- Number of clusters (k)  
  20,                     -- Number of iterations
  0.001                   -- Convergence tolerance
);

This segments the customers into 4 clusters based on their age, income, and activity level.

We can then assign each customer to their cluster and characterize the average attributes per cluster:

-- Assign customers to clusters
SELECT 
  c.id,
  c.age, 
  c.income,
  c.activity, 
  m.cluster
FROM
  customers c,
  cust_cluster_model m
ORDER BY
  m.cluster;

-- Characterize clusters  
SELECT
  cluster,
  COUNT(*) AS size,
  AVG(age) AS avg_age,
  AVG(income) AS avg_income, 
  AVG(activity) AS avg_activity  
FROM
  customers c,
  cust_cluster_model m   
GROUP BY
  cluster;

This gives us the cluster assignment per customer as well as the typical profile of each customer segment, which we could use for targeted marketing, product recommendations, etc.

Best Practices

When building machine learning models in SQL databases, here are a few best practices to keep in mind:

  • Store models as first-class database objects for easy management and integration with applications
  • Validate and clean your data before using it to train models
  • Normalize or standardize numeric variables
  • One-hot encode categorical variables
  • Use feature selection techniques to choose the most relevant variables
  • Split data into training, validation, and test sets
  • Use cross-validation and hyperparameter tuning to optimize model performance
  • Monitor model performance over time and retrain periodically on new data

Limitations

While machine learning in SQL databases is powerful, it‘s not suitable for every use case. Some limitations to be aware of:

  • The types of algorithms available may be limited compared to dedicated ML tools
  • Deep learning capabilities are usually not included
  • Scaling to massive datasets can be challenging
  • Preprocessing and feature engineering are often easier in languages like Python or R
  • More advanced techniques like streaming, graph-based ML are often not supported

For these reasons, a combined approach is often best – use built-in ML capabilities in the database when you can, but don‘t be afraid to leverage dedicated ML platforms when needed for more complex use cases. The key is using the right tool for the job.

Conclusion

In this article, we‘ve seen how SQL databases are becoming powerful machine learning platforms. By providing built-in capabilities for building and using ML models with simple SQL syntax, databases allow developers to apply ML directly where their data lives – simplifying data workflows, improving performance, and enabling real-time predictions.

While not suitable for every ML use case, these capabilities make machine learning much more accessible to developers with SQL skills. This opens up opportunities to embed ML in applications more easily than ever before.

So give it a try! Check out the machine learning features in your database of choice, and see how you can start applying ML to your data directly in SQL. You may be surprised how far you can get without ever leaving the comfort of your database.

Similar Posts