Mask 50% of the labels
If you‘re a data scientist or machine learning engineer working in Python, chances are you rely heavily on the scikit-learn library. As one of the most popular open source machine learning libraries, scikit-learn provides a wide range of tools for tasks like classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
The recent 0.24 release of scikit-learn introduces several major new features that expand its capabilities and improve ease of use. In this article, we‘ll take an in-depth look at five of the most significant additions and show how you can start leveraging them in your own projects. Whether you‘re a scikit-learn power user or relatively new to the library, these latest updates are worth checking out.
1. Mean Absolute Percentage Error (MAPE) Metric
Evaluation metrics are a key part of assessing machine learning model performance. Scikit-learn 0.24 adds a commonly used metric that was previously missing from the library—mean absolute percentage error (MAPE). MAPE expresses the average prediction error as a percentage of the actual values, making it very interpretable.
Prior to this release, calculating MAPE required manually implementing the formula:
def mape(y_true, y_pred): return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
Now, you can simply import the mean_absolute_percentage_error
function from sklearn.metrics
:
from sklearn.metrics import mean_absolute_percentage_errory_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8]
mape_score = mean_absolute_percentage_error(y_true, y_pred) print(mape_score)
This outputs:
0.3273809523809524
Note that the function returns an error value between 0 and 1, not a percentage. The optimal value is 0.0, indicating no error.
Having MAPE readily available will save you from having to reimplement it each time and provides a standardized method for calculating this metric. It‘s a useful option to consider alongside other built-in regression metrics like mean squared error and mean absolute error.
2. Handling Missing Values in OneHotEncoder
The OneHotEncoder
in scikit-learn is widely used to convert categorical variables into a format suitable for machine learning algorithms. However, it would raise an error if the input data contained any missing values. The updated OneHotEncoder
in version 0.24 now handles missing values by default, treating them as an additional category.
Here‘s an example of how it works. First, let‘s create a simple DataFrame with a categorical feature that has some missing values:
import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoderdata = {‘color‘: [‘red‘, ‘green‘, ‘blue‘, np.nan, ‘red‘, np.nan]} df = pd.DataFrame(data)
Next, we‘ll instantiate the OneHotEncoder
and fit it to the data:
ohe = OneHotEncoder(handle_unknown=‘ignore‘) ohe.fit(df)
Now we can transform the data, which will create one additional column for the missing values:
ohe.transform(df).toarray()
This returns:
array([[1., 0., 0., 0.], [0., 0., 1., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [1., 0., 0., 0.], [0., 0., 0., 1.]])
As you can see, the missing values have been assigned to a new category in the last column. This makes it easy to handle missing data without extra processing steps. You can still specify handle_unknown=error
if you want the encoder to raise an error for unknown categories like before.
3. SequentialFeatureSelector for Feature Selection
Feature selection is the process of narrowing down an initial set of features to a subset that is most relevant to the problem. The new SequentialFeatureSelector
class in scikit-learn 0.24 provides an efficient way to automate this task and identify the most informative features.
There are two main types of feature selection the SequentialFeatureSelector
can perform:
- Forward selection: This approach starts with zero features and iteratively adds the most useful features one at a time until the desired number is reached.
- Backward selection: This approach starts with all features and iteratively removes the least useful ones until the desired number remains.
Here‘s an example of how to use the SequentialFeatureSelector
for backward selection on the iris dataset:
from sklearn.datasets import load_iris from sklearn.feature_selection import SequentialFeatureSelector from sklearn.neighbors import KNeighborsClassifierX, y = load_iris(return_X_y=True) knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=2, direction=‘backward‘) sfs.fit(X, y)
print(sfs.get_support())
This outputs:
[ True False True False]
The get_support
method returns a boolean mask indicating which features were selected. In this case, the first and third features were chosen as the most informative for the KNN classifier.
One thing to keep in mind is that the SequentialFeatureSelector
can be computationally expensive, since it evaluates many models during the selection process. However, it‘s a powerful tool to have at your disposal when you need to reduce the dimensionality of your feature space.
4. Hyperparameter Tuning with Successive Halving
Hyperparameter tuning is an essential step in machine learning to find the optimal settings for a given model. Scikit-learn already includes GridSearchCV
and RandomizedSearchCV
for this purpose, but version 0.24 introduces two new experimental classes called HalvingGridSearchCV
and HalvingRandomSearchCV
that implement a novel approach known as successive halving.
The idea behind successive halving is simple but effective. Rather than allocating the full dataset to each hyperparameter configuration, the algorithm starts by training on a small subset. The top-performing configurations are then selected to advance to the next round, where they are trained on a larger subset. This process continues until the best overall configuration is found.
Successive halving can lead to substantial computational savings compared to traditional search methods, while still identifying high-quality hyperparameters. Here‘s a quick example of how to use HalvingRandomSearchCV
with a random forest classifier:
from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.experimental import enable_halving_search_cv from sklearn.model_selection import HalvingRandomSearchCVX, y = make_classification(n_samples=1000) rf = RandomForestClassifier()
param_dist = {‘n_estimators‘: [50, 100, 200], ‘max_depth‘: [3, 5, 7, 9],
‘min_samples_split‘: [2, 5, 10]}search = HalvingRandomSearchCV(estimator=rf, param_distributions=param_dist, factor=2, min_resources=50) search.fit(X, y)
print(search.bestparams)
In this case, we define a parameter grid for the random forest and initialize the HalvingRandomSearchCV
with a few key settings. The factor
parameter controls how aggressively the search space is reduced each iteration, while min_resources
specifies the minimum amount of data to use in the first round.
After fitting, the best_params_
attribute reveals the selected hyperparameters:
{‘max_depth‘: 9, ‘min_samples_split‘: 2, ‘n_estimators‘: 200}
The successive halving classes are still considered experimental in scikit-learn, but they offer a promising new direction for efficient hyperparameter optimization. It will be exciting to see how they evolve in future releases.
5. Semi-Supervised Learning with SelfTrainingClassifier
Semi-supervised learning is a powerful paradigm that combines a small amount of labeled data with a larger set of unlabeled data during training. The SelfTrainingClassifier
introduced in scikit-learn 0.24 provides an easy way to leverage unlabeled data to improve model performance.
Here‘s a code snippet illustrating how it works:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.semi_supervised import SelfTrainingClassifier from sklearn.svm import SVCX, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y)
y_train_mixed = y_train.copy() y_train_mixed[::2] = -1
svm = SVC(probability=True)
self_training = SelfTrainingClassifier(base_estimator=svm) self_training.fit(X_train, y_train_mixed)
In this example, we first split the iris dataset into train and test sets. Then we mask half of the training labels by setting them to -1, indicating they are unlabeled.
Next, we initialize the SelfTrainingClassifier
with an SVM as the base estimator. When we fit the classifier, it uses the labeled examples to generate predictions for the unlabeled ones. The most confident predictions are then added to the labeled set, and the process repeats until all examples have been labeled.
Under the hood, SelfTrainingClassifier
iteratively calls the fit
and predict_proba
methods of the base estimator. This means you can use it with any classifier that supports probability estimates.
Adding semi-supervised learning capabilities to scikit-learn is an exciting development that will make it easier to build models when labeled data is scarce. The SelfTrainingClassifier
is a great addition to the library‘s ever-expanding toolbox.
Conclusion
Scikit-learn continues to evolve with each new release, and version 0.24 brings a host of useful features for data scientists and machine learning practitioners. From new evaluation metrics and data preprocessing options to novel approaches for feature selection, hyperparameter tuning, and semi-supervised learning, there‘s something for everyone in this release.
While we‘ve only scratched the surface of what‘s new, the features covered in this article demonstrate the library‘s ongoing commitment to providing a comprehensive, easy-to-use platform for machine learning in Python. Whether you‘re a seasoned pro or just getting started with scikit-learn, it‘s worth taking some time to explore the latest additions and see how they can help streamline your workflow.
As always, the scikit-learn documentation is the best place to dive deeper into the details and find even more examples of the concepts discussed here. And if you‘re feeling inspired, consider contributing to the project—open source libraries like scikit-learn rely on the support of the community to keep moving forward.
What are your thoughts on the scikit-learn 0.24 release? Which new features are you most excited to try out in your own projects? Let me know in the comments below!