How to Build Powerful Classification Models the Easy Way with Automated Machine Learning

Machine learning is one of the hottest areas in technology today, with applications ranging from facial recognition to fraud detection to medical diagnosis. At its core, machine learning is all about using data to train models that can learn patterns and make predictions.

One of the most common types of machine learning is classification, which involves training a model to categorize data points into two or more discrete classes or categories. For example, you could train a classification model to:

  • Determine if an email is spam or not spam
  • Diagnose whether a medical image shows a tumor or not
  • Categorize a news article as politics, sports, entertainment, etc.
  • Predict whether an online transaction is fraudulent

Traditionally, building a classification model required manually going through a number of time-consuming and complex steps:

  1. Collect and preprocess a labeled training dataset
  2. Select an appropriate classification algorithm (e.g. logistic regression, decision trees, neural networks)
  3. Perform feature engineering to extract relevant features from the data
  4. Optimize the hyperparameters of the chosen algorithm via experimentation
  5. Train the model on the processed data
  6. Evaluate the model‘s performance on a test set
  7. Repeat steps 2-6 until performance is satisfactory
  8. Deploy the final trained model

For experienced data scientists and machine learning engineers, this process is doable but tedious and time-consuming. For those new to machine learning, knowing where to start and how to optimize each step can be daunting and overwhelming.

Fortunately, a new paradigm called automated machine learning (AutoML) has emerged to automate and simplify many of the manual, repetitive steps involved in the traditional machine learning workflow.

AutoML uses intelligent algorithms and heuristics to automatically:

  • Preprocess the input data
  • Perform feature engineering
  • Select the optimal model type and architecture
  • Optimize the model‘s hyperparameters
  • Train and tune the model
  • Provide an evaluation of the final model‘s performance

By automating these steps, AutoML makes it possible to build high-quality machine learning models quickly without extensive knowledge of algorithms or hyperparameter tuning. This makes the power of machine learning accessible to developers and domain experts, not just those with advanced degrees in data science and statistics.

To illustrate how AutoML works in practice, let‘s walk through an example of using a popular Python AutoML library called auto-sklearn to build a classification model.

We‘ll use a synthetic dataset simulating medical test results for diagnosing a rare disease. The dataset has 10,000 rows and 10 columns representing different blood tests and other biomarkers. The last column is the target variable indicating whether the patient tested positive or negative for the disease.

First, let‘s install the auto-sklearn package:

!pip install auto-sklearn

Next, we‘ll load the dataset into a pandas DataFrame and split it into training and test sets:

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv(‘medical_tests.csv‘)

X = data.drop(‘disease‘, axis=1)
y = data[‘disease‘]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now we‘ll import the AutoSklearnClassifier and train an optimized classification model using the fit() method:

import autosklearn.classification

clf = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)

clf.fit(X_train, y_train)

Here we set a total time limit of 120 seconds and a maximum of 30 seconds per model configuration. auto-sklearn will use this time budget to automatically search for the best performing model type, hyperparameter values, and preprocessing steps.

Under the hood, auto-sklearn uses Bayesian optimization to intelligently explore many different model architectures and configurations, refining its search with each iteration to hone in on the optimally performing setup.

The power of AutoML is that it can test out hundreds of models in a short timeframe, far more than a human could manually. It also removes the guesswork and trial-and-error of the traditional model tuning process.

Once the fit() method finishes, we can evaluate the best model found by auto-sklearn on our test set:

y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score, f1_score

print("Accuracy: ", accuracy_score(y_test, y_pred))
print("F1 score: ", f1_score(y_test, y_pred))

In this case, auto-sklearn was able to find a model with 87% accuracy and 0.84 F1 score, which is a strong result. We can inspect the best performing model:

print(clf.show_models())

This outputs:

[(1.0, SimpleClassificationPipeline({‘balancing:strategy‘: ‘none‘,
                                     ‘classifier:__choice__‘: ‘random_forest‘,
                                     ‘data_preprocessing:categorical_transformer:categorical_encoding:__choice__‘: ‘no_encoding‘,
                                     ‘data_preprocessing:numerical_transformer:imputation:strategy‘: ‘mean‘,
                                     ‘data_preprocessing:numerical_transformer:rescaling:__choice__‘: ‘standardize‘,
                                     ‘feature_preprocessor:__choice__‘: ‘select_rates_classification‘,
                                     ‘classifier:random_forest:bootstrap‘: ‘True‘, 
                                     ‘classifier:random_forest:criterion‘: ‘gini‘, 
                                     ‘classifier:random_forest:max_depth‘: ‘None‘, 
                                     ‘classifier:random_forest:max_features‘: 8.68295234155134,
                                     ‘classifier:random_forest:max_leaf_nodes‘: ‘None‘,
                                     ‘classifier:random_forest:min_impurity_decrease‘: 0.0,
                                     ‘classifier:random_forest:min_samples_leaf‘: 2,
                                     ‘classifier:random_forest:min_samples_split‘: 5,
                                     ‘classifier:random_forest:min_weight_fraction_leaf‘: 0.0,
                                     ‘classifier:random_forest:n_estimators‘: 100,
                                     ‘feature_preprocessor:select_rates_classification:alpha‘: 0.24630541871921287,
                                     ‘feature_preprocessor:select_rates_classification:mode‘: ‘fwe‘,
                                     ‘feature_preprocessor:select_rates_classification:score_func‘: ‘f_classif‘},
                                    dataset_properties={‘task‘: 2,
                                                        ‘sparse‘: False,
                                                        ‘multilabel‘: False,
                                                        ‘multiclass‘: False,
                                                        ‘target_type‘: ‘classification‘})),
]

So auto-sklearn found that the random forest algorithm with the specified hyperparameters was the best model after exploring many different options. It also automatically selected relevant features and preprocessed the data in an optimized way.

To deploy this model, we can save it to disk and load it in a production environment to make predictions on new data points:

import joblib

joblib.dump(clf, ‘disease_classifier.joblib‘)

loaded_clf = joblib.load(‘disease_classifier.joblib‘)

new_data = [[ 2.83076677, -0.69418854, 1.42458177, 0.3730851, 0.29178823, -0.68824723, 0.39213657, -0.22027125, 1.93652029, 0.92137593]]

print(loaded_clf.predict(new_data)) 

While AutoML is a powerful tool for accelerating and improving the model building process, it‘s not a complete replacement for human expertise and intuition. Some limitations and things to keep in mind:

  • AutoML still requires having a high-quality, labeled training dataset. It can‘t create signal where there is none.

  • For mission critical applications where model interpretability is important (e.g. medical diagnosis, financial decisions), you may want a human to manually construct a model to understand exactly how it works under the hood. Some AutoML-generated models can be complex and hard to interpret.

  • AutoML libraries make a lot of smart default choices, but manual feature engineering and model tweaking may still be required to squeeze out maximum performance. AutoML provides a strong baseline to build on.

  • Results can vary based on the specific AutoML library used. It‘s good to experiment with a few different ones.

Despite these caveats, AutoML is proving to be a revolutionary technology, making machine learning more accessible than ever before. By automating the hardest, most tedious parts of the model building process, it enables rapid iteration and lowers the barrier to extracting value from data.

As AutoML continues to evolve and improve, it may not be long before developing a powerful classification model is as simple as passing a raw dataset to an AutoML API and getting a highly optimized model back in seconds. Exciting times ahead!

Similar Posts