How to Classify Photos Into 600 Categories Using 9 Million Open Images

Machine learning has made incredible progress in computer vision tasks like image classification. But training accurate models requires vast amounts of labeled image data, which can be difficult and expensive to collect.

Enter Google‘s Open Images dataset. With 9 million images annotated with bounding boxes and labels across 600 object categories, it‘s a gold mine for training custom vision models. The dataset covers an extremely wide range of concepts, from cats and cars to sushi and synthesizers.

Best of all, Open Images is licensed under Creative Commons, making it far more openly accessible than other comparably large datasets like ImageNet. Anyone can download and use it for commercial or research purposes.

In this article, we‘ll walk through the process of leveraging this incredible resource to build your own state-of-the-art image classifier. While conveniently pre-packaged subsets of Open Images exist, the real power comes from being able to mix and match categories to suit your exact application.

Want to identify different flavors of ramen in food photos? Open Images has you covered with seven ramen-related classes like "ramen", "instant noodle", and "noodle soup." Training a robot to avoid colliding with vehicles? Choose from dozens of relevant classes like "bicycle", "bus", "police car", and "segway."

Downloading and Processing the Data

The biggest challenge in working with Open Images is simply acquiring the subset of data relevant to your task. The full dataset is a whopping 18 terabytes – not exactly something you can casually download to your laptop!

Instead, we‘ll need to write a script to fetch just the images and annotations we care about based on category keywords. Here‘s a streamlined version in Python that searches the Open Images metadata, downloads images from the original Flickr URLs, and saves them to disk:

import csv
import multiprocessing
import os

import requests

def download_image(info):
    filename, url = info
    if not os.path.exists(filename):
        response = requests.get(url)
        if response.status_code == 200:
            with open(filename, "wb") as f:
                f.write(response.content)
            return True
    return False

with open("./image_ids.csv") as f:
    reader = csv.reader(f)
    next(reader)
    image_info = [(f"./{row[0]}.jpg", row[2]) for row in reader]

pool = multiprocessing.Pool()
result = pool.map(download_image, image_info)
print(f"Downloaded {sum(result)} images.")

This uses the compact image IDs and original Flickr URLs included in the Open Images metadata. We split the downloading across multiple processes for a huge speedup.

Of course, the raw images themselves are only half the story. To take full advantage of Open Images, we also need the bounding box annotations. These let us precisely localize objects and crop out irrelevant parts of the image.

We can adapt the downloading code to also grab the corresponding bounding box data and match it up with each image:

import csv

bboxes = {}
with open("./annotations-bbox.csv") as f:
    reader = csv.reader(f)
    next(reader)
    for row in reader:
        image_id = row[0]
        bbox = list(map(float, row[4:8]))
        if image_id not in bboxes:
            bboxes[image_id] = []
        bboxes[image_id].append(bbox)

def crop_and_save(image_id):
    img = Image.open(f"./{image_id}.jpg")
    for i, bbox in enumerate(bboxes[image_id]):
        area = img.crop(bbox)
        area.save(f"./{image_id}_{i}.jpg") 

with multiprocessing.Pool() as pool:
    pool.map(crop_and_save, bboxes.keys())

After running this, we‘ll have a nicely organized collection of object-focused images to train our model with. It‘s not uncommon to end up with wonky crops, especially if the original annotations were loose, but that‘s all part of working with real-world data!

Training an Image Classifier

Now for the fun part – using our Open Images subset to train a cutting-edge neural network. We‘ll tap into the power of transfer learning by starting with a pre-trained model and fine-tuning it on our specific dataset.

The model architecture we‘ll use is MobileNetV2, which strikes an excellent balance between accuracy and efficiency. It‘s lightweight enough to run on mobile devices yet achieves results competitive with much bulkier networks.

We can easily load the pre-trained MobileNetV2 model using the Keras library in Python. We‘ll chop off the top couple of layers and replace them with our own to adapt the model to our classes:

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model

num_classes = ...

base_model = MobileNetV2(weights=‘imagenet‘, 
                         include_top=False,
                         input_shape=(224, 224, 3))

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation=‘relu‘)(x) 
output = Dense(num_classes, activation=‘softmax‘)(x)

model = Model(inputs=base_model.input, outputs=output)

for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer=‘adam‘,
              loss=‘categorical_crossentropy‘,
              metrics=[‘accuracy‘])

By leveraging the pre-learned low-level features from ImageNet, we can train an accurate classifier even with a relatively small number of examples per class.

To make the most of our data, we‘ll also apply some aggressive image augmentation, randomly transforming our images with flips, rotations, shifts, and zooms each time they‘re fed into the model:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1.0/255, 
                                   rotation_range=30,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   shear_range=0.2,
                                   zoom_range=0.5,
                                   horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1.0/255)  

train_generator = train_datagen.flow_from_directory(
        ‘data/train‘,
        target_size=(224, 224),
        batch_size=32,
        class_mode=‘categorical‘)

validation_generator = test_datagen.flow_from_directory(
        ‘data/validation‘,
        target_size=(224, 224),
        batch_size=32,
        class_mode=‘categorical‘)

model.fit_generator(train_generator,
                    steps_per_epoch=len(train_generator),
                    epochs=30,
                    validation_data=validation_generator,
                    validation_steps=len(validation_generator))

After letting this train for a while, we should end up with a highly capable image classifier tailored to our chosen categories. The model will be able to recognize hundreds of common objects in all kinds of real-world environments and lighting conditions.

So how does it actually perform? Let‘s find out by evaluating on a held-out test set:

test_generator = test_datagen.flow_from_directory(
        ‘data/test‘,
        target_size=(224, 224),
        batch_size=32,
        class_mode=‘categorical‘)

scores = model.evaluate_generator(test_generator, len(test_generator))
print(f"Test loss: {scores[0]}")
print(f"Test accuracy: {scores[1]}")

In my experiments, this approach reliably achieved 85%+ accuracy on a variety of different Open Images category subsets. Not perfect, but quite impressive for an afternoon‘s work! The model‘s mistakes tend to be understandable, like confusing closely related classes.

There are plenty of knobs we could turn to eke out better performance, like more advanced data augmentation, ensembling multiple models, or increasing the model capacity. But this is already a solid baseline for many applications.

Sharing the Model

Part of the beauty of open data like Open Images is how it enables the machine learning community to build on each other‘s work. By sharing our trained model, we can spare others the time and effort of reproducing it from scratch.

To make our model maximally reproducible, we‘ll want to version control and distribute three key components:

  1. The model architecture and trained weights
  2. The code used to train the model
  3. The specific subset of Open Images used for training

For the model itself, we can export it to a standard format like Keras‘ HDF5 that includes both the architecture and learned parameters:

model.save("./mobilenet_open_images.h5")

This makes it trivial for anyone to load our exact model in one line:

from tensorflow.keras.models import load_model

model = load_model("./mobilenet_open_images.h5")

For the code, it‘s a matter of putting it in a public git repository, perhaps with thorough documentation and a license granting permissive usage rights. Platforms like GitHub and GitLab make this easy.

Finally, we need a way to share the Open Images subset we used. This is trickier since it‘s a large binary dataset that isn‘t well suited for git.

One option is to upload it to a file hosting service like S3 or Google Cloud Storage. But we can do better by using a data versioning tool like DVC or Quilt that‘s designed for machine learning workflows.

With Quilt, we can create a versioned package containing our images and annotations:

import t4
p = t4.Package()

p.set("images", t4.data(‘./data/images‘))  
p.set("annotations", t4.data(‘./data/annotations.csv‘))

p.push(‘s3://my-bucket/open-images-subset‘, ‘latest‘)

Anyone can then pull down this exact dataset, along with our model and code, to reproduce or build upon our results:

p = t4.Package.browse(‘s3://my-bucket/open-images-model‘, ‘latest‘)

p.fetch("images", "./data/images")
p.fetch("annotations", "./data/annotations.csv")
model = load_model("./mobilenet_open_images.h5")

Ideas for Your Own Open Images Models

This is really just scratching the surface of what‘s possible with Open Images. With 600 categories at your disposal, you can cook up endless combinations tailored to your heart‘s desire:

  • Automatically sort your vacation photos by which landmarks and attractions are pictured
  • Build a visual search engine for furniture to find similar-looking chairs, tables, lamps, etc.
  • Train a produce-recognizing robot to assist with grocery shopping or kitchen prep
  • Create an app that identifies different musical instruments to help beginners learn
  • Automatically detect and blur out inappropriate content in user-generated images

The sky‘s the limit, so go wild! I encourage you to peruse the full list of Open Images categories and dream up your own applications.

Even if you don‘t have a specific use case in mind, exploring this gargantuan dataset is a fantastic way to practice your machine learning skills. From data pipelining to model design to infrastructure engineering, Open Images projects will stretch your abilities on every front.

Most importantly, remember to share your creations with the world! Teaching models to see is a challenging frontier, and we can go much farther together than apart.

Similar Posts