Web Scraping in Python – How to Scrape Sci-Fi Movies from IMDB

Are you a science fiction movie buff looking to analyze data on your favorite films? With a bit of Python coding and web scraping, you can extract tons of interesting data on sci-fi flicks from the Internet Movie Database (IMDB). In this tutorial, we‘ll walk through how to use Python to programmatically scrape data from IMDB and store it in a structured format for further analysis.

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites using code. It allows you to gather information that you would otherwise have to painstakingly copy and paste. With web scraping, data that isn‘t available in convenient formats like CSV files or through APIs can be programmatically extracted from HTML pages and stored in a structured way.

Some common use cases for web scraping include:

  • Gathering prices from e-commerce sites for competitor analysis
  • Extracting contact info or other details from directories
  • Pulling sports stats or stock market data for analysis
  • Collecting social media posts or reviews to analyze sentiment
  • Aggregating news articles or blog posts on certain topics

For our purposes, we‘ll be focusing on scraping data about science fiction movies from the popular film site IMDB. IMDB doesn‘t provide an easy way to download their data, but with web scraping we can gather it ourselves for analysis.

Web Scraping Tools in Python

Python is a popular language for web scraping due to its ease of use and the many helpful libraries available for extracting data from HTML. Here are some of the key Python tools we‘ll be using:

Requests – This library allows you to send HTTP requests in Python. We‘ll use it to download the HTML content of web pages we want to scrape.

BeautifulSoup – BeautifulSoup is a Python library for parsing HTML and XML documents. It allows you to extract data from HTML tags and attributes.

Pandas – Pandas is a data analysis library that provides useful data structures like DataFrames for storing structured data. We‘ll use it to store our scraped movie data in a tabular format.

There are other helpful Python web scraping libraries like Scrapy and Selenium, but for our basic needs, Requests and BeautifulSoup will do the trick.

Scraping Sci-Fi Movies from IMDB

Now let‘s get to the actual scraping! We‘ll start by focusing on extracting data from a single IMDB movie page, then expand our scraper to handle multiple pages.

Here‘s the general process:

  1. Send a GET request to the URL of an IMDB movie page
  2. Parse the HTML of the page using BeautifulSoup
  3. Locate and extract the desired data from the parsed HTML
  4. Store the extracted in a dictionary
  5. Repeat for additional pages and store all dictionaries in a list
  6. Convert the list of dictionaries to a pandas DataFrame

First, make sure you have the required libraries installed:

pip install requests bs4 pandas

Now let‘s send a request to a sample IMDB movie page and inspect the HTML to see where the data we want is located:

import requests
from bs4 import BeautifulSoup

url = ‘https://www.imdb.com/title/tt0816692/?ref_=adv_li_tt‘  # Interstellar 
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser‘)
print(soup.prettify())

This prints out the page HTML which we can inspect to locate the tags/attributes containing the data we want to extract. After some inspection, here‘s the info we‘ll aim to extract:

  • Title: <h1 class="sc-b73cd867-0 eKrKux">...</h1>
  • Year: <a href="/title/tt0816692/releaseinfo?ref_=tt_ov_rdat"> ... </a>
  • Rating: <span class="sc-7ab21ed2-2 kYEdvH">...</span>
  • Genre: <a href="/search/title?genres=adventure&explore=title_type,genres">...</a>
  • Runtime: <li class="ipc-inline-list__item">...</li>
  • IMDB Rating: <span class="sc-7ab21ed2-1 jGRxWM">...</span>
  • Metascore: <span class="score-meta">...</span>

We‘ll use BeautifulSoup‘s methods like find() and find_all() to extract the data based on these tags and attributes:

title = soup.find(‘h1‘).text
year = soup.find(‘a‘, attrs = {‘href‘: ‘/title/tt0816692/releaseinfo?ref_=tt_ov_rdat‘}).text.strip(‘()‘)
rating = soup.find(‘span‘, {‘class‘: ‘sc-7ab21ed2-2 kYEdvH‘}).text
genres = [i.text for i in soup.find_all(‘a‘, attrs = {‘href‘: ‘/search/title?genres‘})]
runtime = soup.find(‘li‘, {‘class‘: ‘ipc-inline-list__item‘}).text
imdb_rating = float(soup.find(‘span‘, {‘class‘: ‘sc-7ab21ed2-1 jGRxWM‘}).text) 
metascore = int(soup.find(‘span‘, {‘class‘: ‘score-meta‘}).text)

movie_data = {‘title‘: title,
              ‘year‘: year, 
              ‘rating‘: rating,
              ‘genre‘: genres,
              ‘runtime‘: runtime,
              ‘imdb_rating‘: imdb_rating,
              ‘metascore‘: metascore}

print(movie_data)

This extracts the data from the page and stores it in a dictionary. The output:

{‘title‘: ‘Interstellar‘, ‘year‘: ‘2014‘, ‘rating‘: ‘PG-13‘, ‘genre‘: [‘Adventure‘, ‘Drama‘, ‘Sci-Fi‘], ‘runtime‘: ‘2h 49m‘, ‘imdb_rating‘: 8.6, ‘metascore‘: 74}

Now let‘s scale this up to scrape multiple movie pages. We‘ll create a list of URLs for the movies we want to scrape, then loop through them and apply our scraping logic to each, storing the extracted data in a list of dictionaries.

To generate a list of sci-fi movie URLs, go to this IMDB search page and copy the URLs for the movies you want to scrape: https://www.imdb.com/search/title/?genres=sci-fi&explore=title_type,genres&ref_=adv_prv

url_list = [‘https://www.imdb.com/title/tt0816692/?ref_=adv_li_tt‘,
            ‘https://www.imdb.com/title/tt0083658/?ref_=adv_li_tt‘, 
            ‘https://www.imdb.com/title/tt0499549/?ref_=adv_li_tt‘,
            ...
            ]

scraped_data = []

for url in url_list:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, ‘html.parser‘)

    title = soup.find(‘h1‘).text
    year = soup.find(‘a‘, attrs = {‘href‘: ‘/releaseinfo?ref_=tt_ov_rdat‘}).text.strip(‘()‘)
    rating = soup.find(‘span‘, {‘class‘: ‘sc-7ab21ed2-2 kYEdvH‘}).text 
    genres = [i.text for i in soup.find_all(‘a‘, attrs = {‘href‘: ‘/search/title?genres‘})]
    runtime = soup.find(‘li‘, {‘class‘: ‘ipc-inline-list__item‘}).text
    imdb_rating = float(soup.find(‘span‘, {‘class‘: ‘sc-7ab21ed2-1 jGRxWM‘}).text)
    metascore = int(soup.find(‘span‘, {‘class‘: ‘score-meta‘}).text)

    movie_data = {‘title‘: title,
                  ‘year‘: year,
                  ‘rating‘: rating, 
                  ‘genre‘: genres,
                  ‘runtime‘: runtime,
                  ‘imdb_rating‘: imdb_rating,
                  ‘metascore‘: metascore}

    scraped_data.append(movie_data)

print(scraped_data)    

Running this code will populate our scraped_data list with dictionaries containing the extracted attributes for each movie.

As a courtesy to IMDB, we should add a short delay between requests to avoid overloading their servers:

import time
time.sleep(1) 

Finally, let‘s store the scraped data in a pandas DataFrame:

import pandas as pd

df = pd.DataFrame(scraped_data)
print(df.head())

This converts our list of dictionaries to a nicely formatted DataFrame that we can analyze further. The first few rows:

    title               year  rating         genre           runtime imdb_rating  metascore
0   Interstellar        2014  PG-13   [Adventure, Drama, Sci-Fi]  2h 49m   8.6        74
1   Blade Runner        1982  R       [Action, Drama, Sci-Fi]    1h 57m   8.1        84  
2   Avatar              2009  PG-13   [Action, Adventure, Fantasy] 2h 42m  7.8        83

With our data in a structured format, the real fun begins! Here are some ways we could explore this sci-fi movie data further:

Ideas for Analyzing the Scraped Data

  • Create visualizations in Python using matplotlib or seaborn to look at the distribution of IMDB ratings, Metascores, runtimes, and other attributes.
  • Examine how the sci-fi genre has changed over time based on movie metadata in various decades
  • Use sentiment analysis on the reviews or plot summaries to see how various films are perceived
  • Combine with box office data scraped from another source like Box Office Mojo to see which sci-fi films were the highest grossing
  • Scrape data for movies in other genres and compare sci-fi to those genres across metrics like ratings, length, etc.

Really the analytical possibilities are endless once you have the data! The hardest part is often acquiring the data in the first place, which is why web scraping is such a useful skill to have.

Tips for Responsible Web Scraping

As fun as it is to obtain data by scraping websites, it‘s important to do so responsibly to avoid negatively impacting the sites or violating their terms of use. Some tips:

  • Check the site‘s robots.txt file to see if there are any limitations on what can be scraped. You can use the Python RobotParser library to parse these.
  • Limit the speed of your requests using time.sleep() to avoid overloading the site‘s servers. A 1-2 second delay between requests is usually sufficient.
  • If possible, cache the pages you‘ve already scraped so you don‘t have to re-request them each time you run your code.
  • Use the User-Agent header in your requests to provide information about your script. Some sites may block requests with blank or spammy User-Agent strings.
  • Don‘t scrape any data behind login walls that isn‘t public or that the site wouldn‘t want scraped.
  • Consult the site‘s Terms of Service to ensure your scraping doesn‘t violate them. Some sites prohibit all automated access.
  • Use your scraped data for personal educational purposes and analysis, not for commercial purposes or republishing without permission.

If you follow these guidelines, you can scrape data from sites like IMDB without causing issues. Use your newfound web scraping powers wisely!

Conclusion

In this tutorial, we learned how to use Python libraries like Requests and BeautifulSoup to scrape data on science fiction movies from IMDB. We extracted data like titles, years, ratings, genres and more and stored it in a pandas DataFrame for further analysis.

I encourage you to try scraping some data from IMDB or other sites that interest you and see what insights you can uncover! With the tools and techniques covered here, you‘re well on your way to becoming a master web scraper.

Thanks for reading, and happy scraping!

Similar Posts