Object Detection in Google Colab with Fizyr RetinaNet

Object detection is a fundamental task in computer vision that involves identifying and localizing objects of interest within an image. It has wide-ranging applications, from self-driving cars to medical image analysis to robotics and beyond. In recent years, deep learning models have achieved state-of-the-art performance on benchmarks like COCO and PASCAL VOC.

One of the top object detection models is RetinaNet, introduced by Facebook AI Research in the 2017 paper "Focal Loss for Dense Object Detection". RetinaNet builds on the popular Feature Pyramid Network (FPN) architecture used in other models like Faster R-CNN. However, it introduces several key innovations:

  1. The Focal Loss function to address the large class imbalance between background and foreground classes in one-stage detectors
  2. Anchor boxes at multiple scales and aspect ratios to cover objects of different shapes and sizes
  3. Separate branches for object classification and bounding box regression

These improvements allow RetinaNet to match the speed of previous one-stage detectors like SSD and YOLO while surpassing the accuracy of two-stage detectors like Faster R-CNN. It achieved top results on the challenging COCO dataset.

Fortunately, you don‘t need an expensive GPU machine to start applying RetinaNet to your own object detection problems. In this tutorial, we‘ll walk through how to train a RetinaNet model on a custom dataset entirely for free using Google Colaboratory. Colab provides a Jupyter notebook environment that runs entirely in the cloud, with access to powerful GPUs. It‘s an excellent tool for machine learning projects.

We‘ll be using an open-source Keras implementation of RetinaNet developed by Fizyr. This provides a solid foundation so we can focus on preparing the dataset and running the training pipeline. Let‘s get started!

Preparing a Custom Dataset

The first step is to create an annotated dataset in the format expected by the Fizyr implementation. We‘ll need:

  1. A collection of images containing the objects you want to detect
  2. Bounding box annotations for each image in VOC XML format

For this example, let‘s consider the problem of detecting different fruits in images. I‘ve put together a small dataset of 100 images with apples, bananas, and oranges. We‘ll use 80 images for training and 20 for validation.

To annotate the images, you can use a tool like LabelImg. This allows you to draw bounding boxes around each object and assign them class labels. It outputs annotations in the standard PASCAL VOC XML format.

Here‘s an example of what the annotations look like for one image:

train
IMG_1056.jpg

1032
581
3

0
apple
Unspecified
0
0

227
139
323
216


banana
Unspecified
0
0

473
173
638
439


Once you have annotations for all images, arrange the dataset in the following structure:

dataset/
train/
image1.jpg
image1.xml
image2.jpg
image2.xml

validation/
image81.jpg
image81.xml

Zip up each of the train and validation folders so you have train.zip and validation.zip. Now your dataset is ready to use.

Training RetinaNet in Colab

Create a new notebook in Colab and upload your train.zip and validation.zip files. Fizyr‘s repository provides a handy conversion script to go from the VOC XML format to their expected CSV format. Run these commands in a cell to convert the annotation files:

!git clone https://github.com/fizyr/keras-retinanet.git
!python keras-retinanet/keras_retinanet/bin/convert_model.py

Next, define the classes for your detection problem in a classes.csv file like so:

apple,0
banana,1
orange,2

We‘re now ready to start training! Load a pre-trained model to use as the starting point – Fizyr provides a model pre-trained on ImageNet in the Keras format:

!wget https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet50_coco_best_v2.1.0.h5
PRETRAINED_MODEL = ‘resnet50_coco_best_v2.1.0.h5‘

Set an output directory to store the snapshots from training:

mkdir snapshots

Now launch training with a command like this:

!keras_retinanet/bin/train.py –freeze-backbone –random-transform –weights {PRETRAINED_MODEL} –batch-size 8 –steps 500 –epochs 20 csv train_annotations.csv classes.csv –val-annotations validation_annotations.csv

This trains for 20 epochs, starting from the pre-trained COCO model, using image augmentation, a batch size of 8, and freezing the backbone layers. The –steps argument controls the number of batches in one epoch. Generally more steps and higher batch size improves results but takes longer.

As training progresses, you‘ll see the loss metrics decreasing:

Epoch 1/20
500/500 [==============================] – 495s 990ms/step – loss: 1.3250 – regression_loss: 0.7895 – classification_loss: 0.5355
Epoch 2/20
500/500 [==============================] – 389s 778ms/step – loss: 0.6808 – regression_loss: 0.5112 – classification_loss: 0.1696

Epoch 20/20
500/500 [==============================] – 389s 778ms/step – loss: 0.3012 – regression_loss: 0.2591 – classification_loss: 0.0422

The training script saves a model snapshot after each epoch in the snapshots directory. After 20 epochs, the loss should be quite low, indicating the model has learned to detect objects in our dataset.

Running Inference on Test Images

Let‘s test out our trained model on some new images!

First, download the model snapshot with the lowest loss from the snapshots directory. Then load it in Keras and convert for inference:

model = models.load_model(model_path, backbone_name=‘resnet50‘)
model = models.convert_model(model)

Now let‘s run the model on a test image:

image = read_image_bgr(‘test_image.jpg‘)
draw = image.copy()
draw = cv2.cvtColor(draw, cv2.COLOR_BGR2RGB)

boxes, scores, labels = model.predict_on_batch(np.expand_dims(image, axis=0))
boxes /= scale

for box, score, label in zip(boxes[0], scores[0], labels[0]):
if score < 0.5:
break

color = label_color(label)

b = box.astype(int)
draw_box(draw, b, color=color)

caption = "{} {:.3f}".format(labels_to_names[label], score)
draw_caption(draw, b, caption)

Here are the resulting detections (drawing code omitted for brevity):

The model successfully finds the apples, oranges, and bananas in the test image! You can adjust the confidence threshold for displaying detections by changing the score < 0.5 comparison.

Conclusion and Next Steps

In this tutorial, you learned how to train your own object detector using Fizyr‘s implementation of RetinaNet in Keras. Some key takeaways:

  • Gather and annotate a custom dataset in VOC XML format with a tool like LabelImg
  • Convert annotations and configure in Fizyr‘s CSV format
  • Train a model starting from a pre-trained checkpoint
  • Tune hyperparameters like batch size, steps, image transforms
  • Run inference on test images with filtering by confidence score
  • Visualize detections by drawing bounding boxes and labels

Overall I found Fizyr‘s code quite easy to use and was able to get strong results on a custom dataset without much meta parameter tuning. The ability to run on CPU for inference is also very handy for deploying trained models in production.

There are many ways you could extend this work:

  • Experiment with different ResNet backbone architectures
  • Gather more training data to further improve performance
  • Optimize anchor configurations and image scales for your use case
  • Export the trained model to a mobile-friendly format like TFLite

I hope this gives you a practical starting point for applying RetinaNet to your own object detection projects. The full Colab notebook for this tutorial is available on GitHub [here]. Thanks for reading!

Similar Posts