How I Developed a CNN That Recognizes Emotions and Broke Into the Kaggle Top 10

As humans, we have an innate ability to read emotions from facial expressions. Just by looking at someone‘s face, we can usually tell if they are happy, sad, angry, surprised, disgusted, or afraid. This non-verbal communication is a key part of how we interact with each other.

But what about computers? Can we teach machines to recognize human emotions the same way we do? It‘s a fascinating challenge and one that researchers have been tackling for decades. Thanks to advances in deep learning and the availability of large facial expression datasets, emotion recognition systems are becoming increasingly accurate and finding applications in areas like driver monitoring, retail analytics, and mental health.

In this post, I‘ll share how I used a convolutional neural network (CNN) to build an emotion recognition model that broke into the top 10 on Kaggle‘s Facial Expression Recognition Challenge. I‘ll walk through my process step-by-step, from exploring the dataset to training the model and evaluating results. Let‘s dive in!

The FER2013 Dataset

The dataset used in this competition is FER2013, which contains 35,887 grayscale 48×48 pixel images of faces. The faces are centered and occupy about the same amount of space in each image.

FER2013 was collected by Pierre-Luc Carrier and Aaron Courville using the Google image search API. Faces were automatically registered so that the faces are more or less centered and take up about the same amount of space in each image. The images were then resized to 48×48 pixels and converted to grayscale.

Each image is labeled with one of seven emotion categories:
0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral

Here are some example images from the dataset:

FER2013 example images

The dataset is provided as a single CSV file with three columns:

  • emotion: The numeric label for the emotion (0-6)
  • pixels: A string containing the pixel values for the 48×48 grayscale image
  • Usage: Indicates which dataset this row belongs to – Training, PublicTest, or PrivateTest

To make it easier to work with, I wrote a custom generator function that reads the CSV file and yields batches of preprocessed images and labels:

def fer2013_generator(path, batch_size, target_size=(48, 48)):
    with open(path) as f:
        csv_reader = csv.reader(f)

        images, labels = [], []
        for i, row in enumerate(csv_reader):
            if i == 0:
                continue

            emotion, pixels, usage = row
            pixels = pixels.split()
            pixels = np.array(pixels, ‘float32‘)

            image = pixels.reshape(48, 48, 1)
            image = image.astype(‘float32‘) / 255.

            images.append(image)
            labels.append(emotion)

            if len(images) >= batch_size:
                yield np.array(images), to_categorical(np.array(labels), 7) 
                images, labels = [], []

        if len(images) > 0:
            yield np.array(images), to_categorical(np.array(labels), 7)

This generator reads the CSV file line by line. It ignores the header row, then splits each row into the emotion, pixel values, and usage columns. The pixel values are converted to a float32 numpy array, reshaped to 48x48x1, and normalized to the range [0, 1]. The emotion labels are one-hot encoded using Keras‘s to_categorical function.

Once the number of processed images reaches the specified batch size, the generator yields the batch of images and labels as numpy arrays. This lets us efficiently load the data in chunks rather than loading the entire dataset into memory at once.

Building the CNN Model

With the data ready to go, the next step was designing the convolutional neural network to learn the emotion classification task. After some experimentation, I settled on the following architecture:

model = Sequential()

model.add(Conv2D(32, (3, 3), padding=‘same‘, activation=‘relu‘, input_shape=(48, 48, 1)))
model.add(BatchNormalization())
model.add(Conv2D(32, (3, 3), padding=‘same‘, activation=‘relu‘))  
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding=‘same‘, activation=‘relu‘))
model.add(BatchNormalization())
model.add(Conv2D(64, (3, 3), padding=‘same‘, activation=‘relu‘))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, (3, 3), padding=‘same‘, activation=‘relu‘)) 
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), padding=‘same‘, activation=‘relu‘))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512, activation=‘relu‘))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(256, activation=‘relu‘))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(7, activation=‘softmax‘))

This network consists of three convolutional blocks, followed by flatten and dense layers for classification. Some key aspects:

  • Each conv block contains two Conv2D layers with ‘same‘ padding, ReLU activation, and batch normalization, followed by MaxPooling2D for downsampling and Dropout for regularization. The number of filters doubles at each block (32, 64, 128).

  • Batch normalization helps accelerate training by normalizing the activations of the previous layer at each batch. This allows using higher learning rates.

  • Dropout randomly sets a fraction of input units to 0 at each update during training, which helps prevent overfitting. I used dropout rates of 0.25 after each conv block and 0.5 before the final classification layer.

  • After the final conv block, the feature maps are flattened and passed through two dense layers of size 512 and 256 with ReLU activation, batch norm, and dropout.

  • The output layer is a dense layer with 7 units and softmax activation, corresponding to the 7 emotion classes.

I used the Adam optimizer with a learning rate of 0.0005 and categorical cross-entropy loss. Here‘s the code to compile the model:

from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.0005), 
              loss=‘categorical_crossentropy‘,
              metrics=[‘accuracy‘])

model.summary()

The model summary shows that it has a total of 4,478,023 trainable parameters. This may seem like a lot, but is actually quite small compared to many modern CNN architectures.

Training the Model

To train the model, I first split the full FER2013 dataset into training, validation, and test sets using an 80/10/10 split:

train_gen = fer2013_generator(‘data/fer2013/fer2013.csv‘, batch_size, ‘Training‘)
val_gen = fer2013_generator(‘data/fer2013/fer2013.csv‘, batch_size, ‘PublicTest‘)  
test_gen = fer2013_generator(‘data/fer2013/fer2013.csv‘, batch_size, ‘PrivateTest‘)

I used a batch size of 128 and trained for 200 epochs, which took around 2 hours on a Google Colab GPU. To help prevent overfitting and allow training to progress further, I used a few Keras callback functions:

from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping

checkpoint = ModelCheckpoint(‘emotion_model.h5‘,
                             monitor=‘val_accuracy‘,
                             save_best_only=True,
                             mode=‘max‘,
                             verbose=1) 

reduce_lr = ReduceLROnPlateau(monitor=‘val_loss‘,
                              factor=0.5,
                              patience=7,
                              verbose=1,
                              min_lr=0.00001)

early_stop = EarlyStopping(monitor=‘val_loss‘,
                           patience=15)

callbacks = [checkpoint, reduce_lr, early_stop]

history = model.fit(train_gen,
                    steps_per_epoch=train_steps,
                    epochs=200,
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    callbacks=callbacks)
  • ModelCheckpoint saves the model weights to a file after every epoch where validation accuracy improves. This ensures we keep the best model over the course of training.

  • ReduceLROnPlateau reduces the learning rate by a factor of 0.5 if the validation loss doesn‘t improve for 7 epochs. This can help the model converge better in the later stages of training.

  • EarlyStopping stops training if the validation loss doesn‘t improve for 15 epochs. This is a useful safeguard against wasting computation.

Evaluating Performance

After training finished, I loaded the saved model weights that gave the best validation accuracy and evaluated the model on the held-out PrivateTest set:

model.load_weights(‘emotion_model.h5‘)

test_steps = np.ceil(test_len / batch_size)
test_loss, test_acc = model.evaluate(test_gen, steps=test_steps)

print(‘Test accuracy:‘, test_acc)  

The model achieved 66.7% accuracy on the test set, which would have placed 7th on the original Kaggle leaderboard! Here‘s the full classification report:

              precision    recall  f1-score   support

       Angry       0.61      0.51      0.56       491
     Disgust       0.65      0.52      0.58        55
        Fear       0.64      0.46      0.53       528
       Happy       0.73      0.82      0.77       879
     Neutral       0.62      0.61      0.61       594
         Sad       0.65      0.63      0.64       594
    Surprise       0.81      0.79      0.80       416

    accuracy                           0.67      3557
   macro avg       0.67      0.62      0.64      3557
weighted avg       0.67      0.67      0.67      3557

The model performs best on the Happy and Surprise classes, with F1 scores of 0.77 and 0.80, respectively. It has the most trouble with Fear and Disgust, which are also challenging emotions for humans to distinguish.

To get a better sense of where the model makes mistakes, I plotted a confusion matrix:

FER2013 confusion matrix

We can see that the most common confusions are between Sad and Neutral, Angry and Sad, and Angry and Disgust. Many of the off-diagonal values are non-zero, indicating that the model has room for improvement in learning to separate the emotion classes.

Future Work

While breaking into the Kaggle leaderboard top 10 is a great result, there are still many ways this emotion recognition model could be improved:

  • Experiment with different CNN architectures and hyperparameters. The current architecture was found through manual trial-and-error, but techniques like grid search could identify better configurations.

  • Use transfer learning to leverage models pre-trained on larger face datasets like VGGFace. The FER2013 dataset is relatively small, so initializing with pre-trained weights could boost performance.

  • Perform additional data augmentations like random rotations, shifts, shears, and zooms to increase training set diversity. The validation and test sets should still use original images only.

  • Try other optimization algorithms like SGD with momentum or RMSprop and learning rate schedules like exponential decay.

  • Ensemble multiple models trained with different architectures and initializations. Ensembling is a common technique used by top Kaggle competitors.

Beyond improving accuracy, it would also be valuable to analyze and try to explain what features the CNN is learning to detect different emotions. Techniques like class activation mapping and saliency maps could provide insight into how the model makes its predictions.

Conclusion

In this post, I demonstrated how to build a convolutional neural network that can recognize emotions from facial expressions. Using the FER2013 dataset and a architecture with three conv blocks and two dense layers, the model achieved 66.7% test accuracy and a top 10 position on the Kaggle leaderboard.

While these results are promising, it‘s important to keep in mind the limitations and potential concerns around emotion recognition technology. These systems are not perfectly accurate and can reflect biases in the training data. Facial expressions also don‘t always reflect a person‘s true emotional state. Emotion recognition should be used carefully and transparently, with consideration of privacy and user consent.

That said, when developed responsibly, I believe emotion recognition has the potential to enable valuable applications and improve human-computer interaction. I‘m excited to see what advances the coming years will bring. By sharing our approaches and participating in open competitions like those on Kaggle, we can work together as a community to drive this technology forward.

I‘ve made the code for this project available on GitHub [link to repo]. Please check it out and let me know if you have any questions or feedback. Thanks for reading!

Similar Posts