Uncorking the Data Science of Wine Quality

Wine is a complex beverage with thousands of years of history and a devoted following around the world. Connoisseurs pride themselves on being able to distinguish the subtleties that separate an exquisite vintage from a merely average one. But what exactly makes a wine taste good? Is it possible to quantify the factors that determine wine quality?

Enter data science. By analyzing large datasets of wine chemical and sensory properties, we can start to uncover patterns and build predictive models to understand the science behind the art of winemaking. In this post, we‘ll walks through how to use Python and machine learning to tackle this fascinating challenge.

The Chemistry of Wine Flavor

Before diving into the data, it‘s helpful to understand some of the key chemical compounds that shape a wine‘s flavor profile:

  • Alcohols: Ethanol is the primary alcohol in wine, typically ranging from 10-15% ABV. It contributes to the body, perceived sweetness, and "warmth" of a wine. Higher alcohols like isobutanol and isoamyl alcohol are also present in small quantities and can add complexity.

  • Organic acids: The balance of tartaric, malic, lactic, and other acids gives wine its tart, crisp taste. Too much acidity can be harsh, while too little can make a wine taste flabby. The ideal level depends on the wine style.

  • Sugars: Grapes contain a mix of glucose and fructose sugars that are mostly fermented into alcohol. Leftover sugars contribute to a wine‘s level of sweetness. Dry wines have little to no residual sugar.

  • Phenolic compounds: Includes non-flavonoids like hydroxycinnamates and stilbenes, and flavonoids like anthocyanins and tannins. Contribute to color, astringency, bitterness, and mouthfeel. More prominent in red wines.

  • Volatiles: Aroma compounds like monoterpenes, norisoprenoids, thiols, and esters that contribute to a wine‘s nose. Can be derived from the grapes themselves as well as from fermentation and aging processes.

Measuring these chemical properties in a lab is relatively straightforward. The challenge lies in figuring out how they interact to create the elusive perception of quality in a taster‘s mind.

Defining and Measuring Wine Quality

One issue with predicting wine quality is that there is no universal, objective definition. Ultimately it comes down to the subjective preferences of individual drinkers. However, most wine experts look for the following characteristics in a high-quality wine:

  • Balance: No one component (acidity, tannin, alcohol, etc.) should dominate. The elements should be harmonious.

  • Complexity: Wines with more intense, layered flavors and aromas are prized over those that are simple or one-note.

  • Typicity: How well a wine expresses the typical characteristics of its region, style, and grape varietal.

  • Aging potential: High-end wines should improve over time in the bottle, developing tertiary aromas and flavors.

  • Clarity and Stability: Lack of obvious faults like excessive volatile acidity, oxidation, or cloudiness.

Sensory scientists have developed standardized rubrics like the UC Davis 20-point scale and 100-point scale to rate wines more consistently. However, these are still inherently subjective and can suffer from bias, individual variability, and interaction effects.

Dr. Hildegarde Heymann of UC Davis, a leading sensory scientist in the wine industry, notes that quality ratings are influenced by many external factors beyond just a wine‘s chemistry:

"In addition to the intrinsic flavor of the wine, the perception of quality is impacted by things like the context in which the wine is tasted, the taster‘s level of experience and knowledge, and even the price and packaging of the wine. It‘s an incredibly complex interaction between sensory, cognitive, and affective processes."

Collecting quality ratings from experts, as well as regular consumers, is just as important as the chemical data in building robust predictive models. Some in the industry are even exploring using AI to analyze wine reviews and tasting notes to extract more nuanced information.

Wrangling Wine Data

Now let‘s get our hands dirty with some real data. We‘ll be using a dataset of 1599 red wines from the Vinho Verde region of Portugal, graciously provided by Paulo Cortez of the University of Minho. It includes 11 physicochemical variables and a sensory quality score from 0-10:

import pandas as pd

df = pd.read_csv(‘winequality-red.csv‘, sep=‘;‘)
df.info()
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Let‘s take a look at the distribution of the quality variable:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))
sns.histplot(df.quality, kde=False, bins=6, edgecolor=‘black‘)
plt.xlabel(‘Quality Score‘, size=14)
plt.ylabel(‘Count‘, size=14) 
plt.title(‘Distribution of Wine Quality Scores‘, size=16)
plt.show()

Wine Quality Distribution

The majority of wines fall in the average 5-6 quality range, with very few scoring below 4 or above 7. To simplify modeling, we can bucket the quality scores into discrete categories:

df[‘quality_label‘] = pd.cut(df.quality, bins=[0, 5, 7, 10], labels=[‘poor‘, ‘average‘, ‘good‘])

Next we‘ll transform the data to make it suitable for machine learning algorithms:

  • Scale the feature columns to have zero mean and unit variance. This prevents features with larger magnitudes from dominating the objective function.
  • Split the data into training and test sets. We‘ll hold out 20% of the data to evaluate model performance on unseen examples.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df.drop([‘quality‘, ‘quality_label‘], axis=1) 
y = df.quality_label

ss = StandardScaler()
X_scaled = ss.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

We now have our data cleaned, scaled, and split into training and test sets. Time to start building some models!

Modeling Wine Quality

Since we‘re trying to predict a categorical outcome (poor, average, or good quality), this is a classification problem. We‘ll evaluate a few different popular algorithms:

  • Logistic Regression: A linear model that estimates the probability of an example belonging to each class.
  • Support Vector Machines: Tries to find a hyperplane that maximally separates the classes in high-dimensional space.
  • Random Forest: An ensemble of decision trees that collectively vote on the predicted class.
  • Multi-layer Perceptron: A simple neural network with one hidden layer.

Techniques like cross-validation and grid search can be used to tune the hyperparameters of each model for optimal performance:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC 
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV

models = {
    ‘LogisticRegression‘: LogisticRegression(max_iter=1000),
    ‘SVC‘: SVC(),
    ‘RandomForest‘: RandomForestClassifier(),
    ‘MultiLayerPerceptron‘: MLPClassifier()
}

for name, model in models.items():
    print(f‘Training {name}...‘)

    params = {}
    if name == ‘SVC‘:
        params = {‘C‘: [0.1, 1.0, 10.0], ‘kernel‘: [‘linear‘, ‘poly‘, ‘rbf‘]}
    elif name == ‘RandomForest‘:
        params = {‘n_estimators‘: [50, 100, 200], ‘max_depth‘: [None, 10, 20, 30]}
    elif name == ‘MultiLayerPerceptron‘:
        params = {‘hidden_layer_sizes‘: [(50,), (100,)], ‘alpha‘: [0.0001, 0.001, 0.01]}

    grid = GridSearchCV(estimator=model, param_grid=params, cv=5)
    grid.fit(X_train, y_train)
    model = grid.best_estimator_

    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average=‘macro‘)
    recall = recall_score(y_test, y_pred, average=‘macro‘)

    print(f‘Best model hyperparameters: {grid.best_params_}‘)  
    print(f‘Test Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}\n‘)

The results show that the Random Forest model performs best, achieving 81% accuracy, 0.82 precision, and 0.77 recall on the test set. The grid search found that using 200 trees with a maximum depth of 20 gave optimal results.

We can inspect which features the Random Forest model considers most important for distinguishing wine quality:

rf = models[‘RandomForest‘]
importances = pd.Series(rf.feature_importances_, index=X.columns)

top_features = importances.nlargest(5)
print(‘Top 5 Features for Predicting Wine Quality:‘)
print(top_features)
Top 5 Features for Predicting Wine Quality:
alcohol                 0.171052
volatile acidity        0.124322
sulphates               0.119693
total sulfur dioxide    0.096843
chlorides               0.081143
dtype: float64

Alcohol content, volatile acidity, and sulphate levels are considered the most predictive variables by the Random Forest, which aligns with the domain knowledge of wine experts.

Of course, even the best-performing model is far from perfect. An accuracy of 81% still means the model is wrong about 1 in 5 times. Wine quality is determined by complex interactions between many variables, not all of which may be captured in this dataset. Incorporating more data like grape variety, vintage year, and wine region could help build more accurate models.

Industry Applications and Future Research

The wine industry is increasingly adopting data science and machine learning techniques to help improve wine production and quality control. Some potential use cases include:

  • Precision Viticulture: Using sensors and remote imaging to monitor vine health, soil moisture, and microclimate conditions in real-time. ML models can recommend optimal irrigation, fertilization, and pest control strategies.

  • Fermentation Monitoring: Analyzing data from fermentation tanks to predict and prevent stuck fermentations or detect faults early. Could enable winemakers to correct issues before they ruin a whole batch.

  • Blending and Sensory Analysis: Using ML to suggest optimal grape variety and proportion blends to achieve a target flavor profile. Models trained on wine reviews and tasting notes can identify sensory attributes correlated with higher quality scores.

  • Recommender Systems: Building personalized wine recommendation engines based on a customer‘s previous purchases, ratings, and tasting preferences. Can improve sales and customer loyalty for wineries and retailers.

Beyond industry applications, many researchers are working on advancing the state-of-the-art in wine quality modeling. Some recent papers from the scientific literature include:

As we collect more and higher-quality data at every stage of the winemaking process, from vineyard to bottle, the predictive power of our models will only improve. Winemakers may never be completely replaced by algorithms, but they can certainly use data science as a tool to help consistently craft higher quality wines and deliver a better experience for customers.

Concluding Thoughts

Wine has been a part of human civilization for millennia, and there‘s a reason it continues to captivate us. At its best, wine is a marriage between artistry and chemistry, tradition and innovation, individual expression and regional typicity.

Data science techniques like machine learning can help us better understand the complex factors that determine wine quality. By analyzing the chemical and sensory properties of wine, we can build models to predict how different variables impact the final product. While far from perfect, these models can be a useful tool for winemakers looking to optimize production and create higher quality wines.

However, it‘s important to remember that data is only one part of the equation. The beauty and mystery of wine also lies in the human stories behind it – the passion and intuition of the winemakers, the culture and history of different wine regions, the subjective experiences of everyone who enjoys it.

Numbers alone can never fully capture what makes a truly great wine sing. It takes an artist‘s palate and craftsmanship to create something that can inspire contemplation and conversation, evoke a treasured memory, or elevate a simple meal into an unforgettable experience.

So let us raise a glass to the promise and potential of wine data science, and all those who dedicate themselves to unlocking the secrets of this endlessly fascinating beverage. May they continue to blend cutting-edge technology with time-honored tradition, and bring us ever closer to the holy grail of the perfect pour. Cheers!

Similar Posts