Mask 50% of the labels

If you‘re a data scientist or machine learning engineer working in Python, chances are you rely heavily on the scikit-learn library. As one of the most popular open source machine learning libraries, scikit-learn provides a wide range of tools for tasks like classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

The recent 0.24 release of scikit-learn introduces several major new features that expand its capabilities and improve ease of use. In this article, we‘ll take an in-depth look at five of the most significant additions and show how you can start leveraging them in your own projects. Whether you‘re a scikit-learn power user or relatively new to the library, these latest updates are worth checking out.

1. Mean Absolute Percentage Error (MAPE) Metric

Evaluation metrics are a key part of assessing machine learning model performance. Scikit-learn 0.24 adds a commonly used metric that was previously missing from the library—mean absolute percentage error (MAPE). MAPE expresses the average prediction error as a percentage of the actual values, making it very interpretable.

Prior to this release, calculating MAPE required manually implementing the formula:

def mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

Now, you can simply import the mean_absolute_percentage_error function from sklearn.metrics:

from sklearn.metrics import mean_absolute_percentage_error

y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8]

mape_score = mean_absolute_percentage_error(y_true, y_pred) print(mape_score)

This outputs:

0.3273809523809524

Note that the function returns an error value between 0 and 1, not a percentage. The optimal value is 0.0, indicating no error.

Having MAPE readily available will save you from having to reimplement it each time and provides a standardized method for calculating this metric. It‘s a useful option to consider alongside other built-in regression metrics like mean squared error and mean absolute error.

2. Handling Missing Values in OneHotEncoder

The OneHotEncoder in scikit-learn is widely used to convert categorical variables into a format suitable for machine learning algorithms. However, it would raise an error if the input data contained any missing values. The updated OneHotEncoder in version 0.24 now handles missing values by default, treating them as an additional category.

Here‘s an example of how it works. First, let‘s create a simple DataFrame with a categorical feature that has some missing values:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data = {‘color‘: [‘red‘, ‘green‘, ‘blue‘, np.nan, ‘red‘, np.nan]} df = pd.DataFrame(data)

Next, we‘ll instantiate the OneHotEncoder and fit it to the data:

  
ohe = OneHotEncoder(handle_unknown=‘ignore‘)
ohe.fit(df)

Now we can transform the data, which will create one additional column for the missing values:

ohe.transform(df).toarray()

This returns:

array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],  
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.]])

As you can see, the missing values have been assigned to a new category in the last column. This makes it easy to handle missing data without extra processing steps. You can still specify handle_unknown=error if you want the encoder to raise an error for unknown categories like before.

3. SequentialFeatureSelector for Feature Selection

Feature selection is the process of narrowing down an initial set of features to a subset that is most relevant to the problem. The new SequentialFeatureSelector class in scikit-learn 0.24 provides an efficient way to automate this task and identify the most informative features.

There are two main types of feature selection the SequentialFeatureSelector can perform:

  • Forward selection: This approach starts with zero features and iteratively adds the most useful features one at a time until the desired number is reached.
  • Backward selection: This approach starts with all features and iteratively removes the least useful ones until the desired number remains.

Here‘s an example of how to use the SequentialFeatureSelector for backward selection on the iris dataset:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier

X, y = load_iris(return_X_y=True) knn = KNeighborsClassifier(n_neighbors=3)

sfs = SequentialFeatureSelector(knn, n_features_to_select=2, direction=‘backward‘) sfs.fit(X, y)

print(sfs.get_support())

This outputs:

[ True False  True False]  

The get_support method returns a boolean mask indicating which features were selected. In this case, the first and third features were chosen as the most informative for the KNN classifier.

One thing to keep in mind is that the SequentialFeatureSelector can be computationally expensive, since it evaluates many models during the selection process. However, it‘s a powerful tool to have at your disposal when you need to reduce the dimensionality of your feature space.

4. Hyperparameter Tuning with Successive Halving

Hyperparameter tuning is an essential step in machine learning to find the optimal settings for a given model. Scikit-learn already includes GridSearchCV and RandomizedSearchCV for this purpose, but version 0.24 introduces two new experimental classes called HalvingGridSearchCV and HalvingRandomSearchCV that implement a novel approach known as successive halving.

The idea behind successive halving is simple but effective. Rather than allocating the full dataset to each hyperparameter configuration, the algorithm starts by training on a small subset. The top-performing configurations are then selected to advance to the next round, where they are trained on a larger subset. This process continues until the best overall configuration is found.

Successive halving can lead to substantial computational savings compared to traditional search methods, while still identifying high-quality hyperparameters. Here‘s a quick example of how to use HalvingRandomSearchCV with a random forest classifier:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier  
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

X, y = make_classification(n_samples=1000) rf = RandomForestClassifier()

param_dist = {‘n_estimators‘: [50, 100, 200], ‘max_depth‘: [3, 5, 7, 9],
‘min_samples_split‘: [2, 5, 10]}

search = HalvingRandomSearchCV(estimator=rf, param_distributions=param_dist, factor=2, min_resources=50) search.fit(X, y)

print(search.bestparams)

In this case, we define a parameter grid for the random forest and initialize the HalvingRandomSearchCV with a few key settings. The factor parameter controls how aggressively the search space is reduced each iteration, while min_resources specifies the minimum amount of data to use in the first round.

After fitting, the best_params_ attribute reveals the selected hyperparameters:

  
{‘max_depth‘: 9, ‘min_samples_split‘: 2, ‘n_estimators‘: 200}

The successive halving classes are still considered experimental in scikit-learn, but they offer a promising new direction for efficient hyperparameter optimization. It will be exciting to see how they evolve in future releases.

5. Semi-Supervised Learning with SelfTrainingClassifier

Semi-supervised learning is a powerful paradigm that combines a small amount of labeled data with a larger set of unlabeled data during training. The SelfTrainingClassifier introduced in scikit-learn 0.24 provides an easy way to leverage unlabeled data to improve model performance.

Here‘s a code snippet illustrating how it works:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC

X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y)

y_train_mixed = y_train.copy() y_train_mixed[::2] = -1

svm = SVC(probability=True)

self_training = SelfTrainingClassifier(base_estimator=svm) self_training.fit(X_train, y_train_mixed)

In this example, we first split the iris dataset into train and test sets. Then we mask half of the training labels by setting them to -1, indicating they are unlabeled.

Next, we initialize the SelfTrainingClassifier with an SVM as the base estimator. When we fit the classifier, it uses the labeled examples to generate predictions for the unlabeled ones. The most confident predictions are then added to the labeled set, and the process repeats until all examples have been labeled.

Under the hood, SelfTrainingClassifier iteratively calls the fit and predict_proba methods of the base estimator. This means you can use it with any classifier that supports probability estimates.

Adding semi-supervised learning capabilities to scikit-learn is an exciting development that will make it easier to build models when labeled data is scarce. The SelfTrainingClassifier is a great addition to the library‘s ever-expanding toolbox.

Conclusion

Scikit-learn continues to evolve with each new release, and version 0.24 brings a host of useful features for data scientists and machine learning practitioners. From new evaluation metrics and data preprocessing options to novel approaches for feature selection, hyperparameter tuning, and semi-supervised learning, there‘s something for everyone in this release.

While we‘ve only scratched the surface of what‘s new, the features covered in this article demonstrate the library‘s ongoing commitment to providing a comprehensive, easy-to-use platform for machine learning in Python. Whether you‘re a seasoned pro or just getting started with scikit-learn, it‘s worth taking some time to explore the latest additions and see how they can help streamline your workflow.

As always, the scikit-learn documentation is the best place to dive deeper into the details and find even more examples of the concepts discussed here. And if you‘re feeling inspired, consider contributing to the project—open source libraries like scikit-learn rely on the support of the community to keep moving forward.

What are your thoughts on the scikit-learn 0.24 release? Which new features are you most excited to try out in your own projects? Let me know in the comments below!

Similar Posts