Web Scraping with Python – How to Scrape Data from Twitter using Tweepy and Snscrape

Twitter has become a goldmine of data for developers, data scientists, and researchers looking to extract insights from social media. With over 330 million monthly active users generating a constant stream of tweets on every topic imaginable, Twitter offers a wealth of real-time information just waiting to be scraped and analyzed.

In this expert guide, we‘ll dive into the world of web scraping with Python and explore two powerful libraries – Tweepy and Snscrape – for extracting data from Twitter. Whether you‘re a data journalist investigating trending topics, a machine learning engineer building models to detect hate speech, or a marketer tracking brand sentiment, knowing how to scrape tweets can be an invaluable skill. Let‘s get started!

What is Web Scraping?

Web scraping refers to the automated process of extracting data from websites. Essentially, a web scraper is a piece of software (like a Python script) that sends HTTP requests to a web server, retrieves the HTML source code of web pages, and parses that code to extract the desired data, usually in a structured format like JSON, CSV, or a database.

While web APIs often provide a convenient way to access web data, not all websites offer APIs. And even when APIs are available, they may have limitations in terms of the amount of historical data or the rate at which you can make requests. Web scraping allows you to flexibly extract the precise data you need.

Why Scrape Data from Twitter?

So what kinds of insights can we glean by scraping tweets? Here are just a few examples:

  • Sentiment analysis: Gauge public opinion and reactions to a specific topic, event, or brand by analyzing the sentiment of tweets. Positive or negative sentiment can be detected using machine learning techniques applied to the text of tweets.

  • Trend tracking: Identify trending topics, hashtags, and influencers by monitoring the volume and virality of tweets over time. This can be valuable for brand monitoring, journalism, and social science research.

  • Network analysis: Examine the relationships and interactions between Twitter users by constructing retweet networks and mention/reply networks. Network analysis can reveal communities of users and patterns of information dissemination.

  • Content analysis: Analyze the actual content of tweets, including the text, URLs, images, and videos. This can involve applying natural language processing to extract entities, topics, and keywords to characterize the conversation.

  • Geospatial analysis: Many tweets contain geolocation metadata indicating where the user was when the tweet was posted. By collecting and mapping geotagged tweets, we can visualize the geographic distribution of the conversation.

The possibilities are endless – your imagination and analytical goals are the limit! Throughout this guide, we‘ll see how to actually obtain tweet data to enable these types of analyses.

Tweepy: Accessing the Twitter API

Tweepy is an easy-to-use Python library for accessing the official Twitter API. The Twitter API is a web API that allows developers to programmatically interact with Twitter‘s platform to do things like post tweets, read user profile information and of course, search and retrieve tweets. Here‘s how to get started with Tweepy:

  1. Install Tweepy using pip:

    pip install tweepy
  2. Sign up for a Twitter Developer account at https://developer.twitter.com/en/apply-for-access. You‘ll need to create a new Twitter app which will give you the API credentials needed to authenticate with the API.

  3. Once your app is created, obtain your consumer key, consumer secret, access token, and access token secret from the app dashboard. Keep these confidential.

  4. Pass your API credentials to Tweepy to create an API client:

import tweepy

consumer_key = "YOUR_CONSUMER_KEY" consumer_secret = "YOUR_CONSUMER_SECRET" access_token = "YOUR_ACCESS_TOKEN" access_token_secret = "YOUR_ACCESS_TOKEN_SECRET"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

  1. Use the various methods provided by the Tweepy API client to search, retrieve and process tweets. For example, to search for the most recent 100 tweets mentioning "python":
tweets = api.search_tweets(q="python", count=100)
for tweet in tweets:
    print(tweet.text)

Tweepy provides access to almost all of Twitter‘s API functionality, including retrieving user timelines, posting tweets, following users, and more. Check out the Tweepy documentation for complete details: https://docs.tweepy.org/en/stable/

However, the Twitter API has some limitations to keep in mind:

  • Strict rate limits on how many requests you can make per 15-minute window
  • Many API calls are restricted to the past 7 days of data only
  • Certain API calls are only available with the paid Premium or Enterprise tiers

Sometimes you need more historical data or higher rate limits than the Twitter API offers. That‘s where Snscrape comes in.

Snscrape: Scraping Twitter without the API

Snscrape is another Python library for social media scraping, but unlike Tweepy, it doesn‘t use the official Twitter API. Instead, it scrapes data directly from the twitter.com website. This means that Snscrape isn‘t subject to the same rate limits and time restrictions as the Twitter API.

  1. Install Snscrape via pip:

    pip install snscrape
  2. To scrape tweets with Snscrape, you don‘t need to set up a Twitter Developer account or obtain API credentials. You can simply import the library and start scraping right away:

import snscrape.modules.twitter as sntwitter

maxTweets = 1000

for i,tweet in enumerate(sntwitter.TwitterSearchScraper(‘python since:2020-06-01 until:2020-07-31‘).get_items()): if i > maxTweets: break print(tweet.content)

This code snippet scrapes the most recent 1000 tweets containing the keyword "python" that were posted between June 1, 2020 and July 31, 2020.

The syntax for the search query in Snscrape is quite flexible and supports many advanced search operators, similar to the search syntax on the twitter.com website. You can filter tweets by keyword, hashtag, username, mention, date range, geographic location, and more.

Because the tweets scraped by Snscrape aren‘t coming through the official API, they contain some different fields and metadata than API tweets. But Snscrape still provides most of the essential data, like the tweet text, author, timestamp, number of likes and retweets, etc.

  1. Often it‘s useful to store the scraped tweets in a structured format for further analysis. We can use the Pandas library to store the tweets in a dataframe:
import pandas as pd

tweets_list = []

for i,tweet in enumerate(sntwitter.TwitterSearchScraper(‘python since:2020-06-01 until:2020-07-31‘).get_items()): if i > maxTweets: break tweets_list.append([tweet.date, tweet.id, tweet.content, tweet.username])

tweets_df = pd.DataFrame(tweets_list, columns=[‘Datetime‘, ‘Tweet Id‘, ‘Text‘, ‘Username‘])

Now we have a Pandas dataframe containing 1000 tweets, with columns for the datetime, tweet ID, text, and username. We can save this dataframe to a CSV file or database for future analysis.

Tweepy vs Snscrape: Which to Use?

So when should you use Tweepy and when should you use Snscrape? It depends on your specific needs and constraints.

Reasons to use Tweepy:

  • You need access to the full range of Twitter API functionality, like posting tweets, sending direct messages, or managing lists, followers, etc.
  • You want the tweet data to be in the standardized format provided by the API, with all the official metadata fields
  • You‘re building an application that needs to make requests continuously and stay under the rate limits

Reasons to use Snscrape:

  • You need to scrape a large amount of historical tweets beyond the 7-day limit of the standard Twitter API
  • You want to avoid the rate limits and access restrictions of the API
  • You only need basic tweet data like the text, author, and datetime – the extra metadata provided by the API isn‘t necessary for your analysis

Whichever method you choose, there are some important guidelines to follow when scraping Twitter data:

  1. Respect Twitter‘s terms of service and developer agreement. Don‘t abuse the API or scrape data in a way that violates Twitter‘s rules.

  2. Store and handle scraped tweet data securely and protect users‘ privacy. If you‘re going to publicly share an analysis based on scraped tweets, usually it‘s best to aggregate the data and not expose individual users‘ tweet histories.

  3. Don‘t overwhelm Twitter‘s servers with too many requests too quickly. Add delays or throttling between your requests to avoid getting your IP address banned.

  4. Use the data for good! Scraped tweets can fuel powerful insights and research, but can also be misused for spam, harassment, or other malicious purposes. Be ethical and socially responsible in your Twitter scraping projects.

Conclusion

Web scraping is a valuable technique for data scientists to have in their toolkit, and scraping Twitter is a great way to get real-time data on almost any topic imaginable. With Python libraries like Tweepy and Snscrape, you can flexibly obtain tweets either through the official API or with direct web scraping.

We‘ve only scratched the surface of what‘s possible with Twitter data in this guide. By cleaning, analyzing, and visualizing tweets, you can derive meaningful insights into public opinion, track the spread of news and rumors, and even build predictive models (e.g. for sentiment or virality). You might analyze tweets to measure the social impact of a major event, or to understand differences in how different regions talk about a political issue, or to detect the outbreak of a disease.

Whatever your goals, happy scraping and analyzing! The Twitterverse awaits you.

Additional Resources

Want to learn more about scraping Twitter data and doing data science with tweets? Check out these resources:

Similar Posts