Basic Data Analysis on Twitter with Python: A Comprehensive Guide

Social media has become a rich source of public data that can provide valuable insights when analyzed. Twitter, with its real-time feed of tweets from millions of users across the globe, is a particularly attractive platform for data analysis.

By collecting and analyzing tweet data, we can uncover insights into public sentiment about various topics, identify trending hashtags and content, characterize different types of Twitter users, and much more. Coupled with the power and simplicity of the Python programming language and its expansive ecosystem of libraries, performing basic data analysis on Twitter data is within reach for anyone with some coding skills.

In this comprehensive guide, we‘ll walk through the process of analyzing Twitter data with Python from start to finish. We‘ll cover setting up a Python environment, authenticating with the Twitter API, collecting and preprocessing tweet data, performing exploratory analysis and visualization, and discussing some limitations and extensions to this basic analysis. Let‘s get started!

Setting up Your Python Environment

First, you‘ll need to set up a Python environment with a few key libraries installed:

  • Tweepy: for authenticating with the Twitter API and collecting tweet data
  • TextBlob: for performing sentiment analysis on tweet text
  • Matplotlib: for visualizing Twitter data
  • Pandas: for data manipulation and analysis

You can install all of these libraries using pip:

pip install tweepy textblob matplotlib pandas

Authenticating with the Twitter API

To access Twitter data programmatically, you need to create a Twitter Developer account and obtain authentication keys and access tokens. Follow these steps:

  1. Go to the Twitter Developer Portal (https://developer.twitter.com/en/portal/dashboard) and apply for a Developer account if you don‘t have one already. You‘ll need to describe your use case for analyzing Twitter data.

  2. Once approved, create a new Project and App in the developer dashboard. This will generate the authentication keys you need.

  3. Make a note of your consumer key, consumer secret, access token, and access token secret.

  4. In your Python code, use the Tweepy library to create an authenticated API client:

import tweepy

consumer_key = "your-consumer-key"
consumer_secret = "your-consumer-secret" 
access_token = "your-access-token"
access_token_secret = "your-access-token-secret"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Replace the placeholder values with your actual authentication keys. This api object is now ready to make authenticated requests to collect tweet data.

Collecting and Preprocessing Tweet Data

To fetch tweets, you can use the tweepy.Cursor object along with the api.search() method. This allows you to specify search keywords and other criteria to filter tweets. For example:

search_words = "coronavirus, covid19" 
date_since = "2020-03-01"

tweets = tweepy.Cursor(api.search,
              q=search_words,
              lang="en",
              since=date_since).items(1000)

This code will collect the most recent 1000 English-language tweets containing the keywords "coronavirus" or "covid19" since March 1, 2020. You can modify the search query and number of tweets to fetch as per your analysis needs.

Once you have the raw tweets, the next step is to preprocess the text data to clean and standardize it for analysis. Some common preprocessing steps include:

  • Removing URLs, mentions, special characters, and digits
  • Converting text to lowercase
  • Tokenizing text into individual words
  • Removing stopwords (common words like "the", "and", "is")
  • Stemming or lemmatizing words (reducing words to their base forms)

The TextBlob library provides easy methods to perform many of these steps:

from textblob import TextBlob

text = "This is a raw tweet with a URL https://example.com and a mention @johndoe123. Let‘s clean it up!"

blob = TextBlob(text)
clean_text = ‘ ‘.join(w for w in blob.words if w.isalpha()) 

print(clean_text)

This outputs: "This is raw tweet with URL and mention Let clean it up"

Exploratory Data Analysis on Tweets

With cleaned tweet text data in hand, we can begin to explore it quantitatively. Some basic exploratory analysis techniques include:

  • Word frequencies: Count the most common words across all tweets to get a sense of the major talking points. You can visualize the top words in a bar chart.
from collections import Counter

all_words = [w for tweet in tweets for w in TextBlob(tweet.text).words] 
word_counts = Counter(all_words)

top_words = word_counts.most_common(20)
print(top_words)

word, count = zip(*top_words)
plt.bar(range(len(count)), count)
plt.xticks(range(len(word)), labels=word, rotation=‘vertical‘)
plt.xlabel(‘Word‘)
plt.ylabel(‘Count‘) 
plt.show()
  • Top hashtags and mentions: Similar to word frequencies, look at the most frequently used hashtags and mentions to identify trending topics and key influencers in the discussion.

  • Tweet lengths: Compute the distribution of tweet lengths to understand how verbose or terse the tweet discourse is. Twitter‘s character limit is a key constraint.

tweet_lengths = [len(tweet.text) for tweet in tweets]

plt.hist(tweet_lengths, bins=20)
plt.xlabel(‘Tweet length‘)
plt.ylabel(‘Count‘)
plt.title(‘Distribution of Tweet Lengths‘)
plt.show()
  • Time series analysis: Look at tweet volumes over time to identify any spikes or periodicity in Twitter chatter about a topic. Resample data to a suitable time window like hours or days.
import pandas as pd

data = [[tweet.created_at, tweet.text] for tweet in tweets]
df = pd.DataFrame(data, columns=[‘date‘, ‘text‘])

df[‘date‘] = pd.to_datetime(df[‘date‘])
df.set_index(‘date‘, inplace=True)

plt.figure(figsize=(12,5))
df.resample(‘1D‘).count().text.plot()
plt.ylabel(‘Number of Tweets‘)  
plt.title(‘Tweet Volumes over Time‘)
plt.show()

Sentiment Analysis on Tweet Text

A key use case of analyzing tweet text data is gauging public sentiment – whether tweets about a topic are generally positive, negative or neutral in tone. While sentiment is complex and depends heavily on context, we can use simple techniques to estimate the overall sentiment of tweet text.

The TextBlob library provides a straightforward API for performing sentiment analysis using a pre-trained classifier under the hood:

from textblob import TextBlob

example_tweet = "This new vaccine is amazing and gives me so much hope! I can‘t wait for this pandemic to be over."

blob = TextBlob(example_tweet)

print(blob.sentiment)
print(blob.sentiment.polarity)

This outputs:

Sentiment(polarity=0.8, subjectivity=0.9)
0.8

The sentiment polarity score ranges from -1 (most negative) to +1 (most positive), with 0 being neutral. We can compute the average sentiment score across all tweets to gauge overall public sentiment:

sentiments = [TextBlob(tweet.text).sentiment.polarity for tweet in tweets]

plt.hist(sentiments, bins=20)
plt.xlabel(‘Sentiment Polarity‘)
plt.ylabel(‘Count‘)
plt.title(f‘Sentiment Analysis of {len(sentiments)} Tweets‘)
plt.show()

print(f"Average sentiment: {sum(sentiments) / len(sentiments):.2f}")

Limitations and Extensions

This basic Twitter data analysis only scratches the surface of insights you can derive from tweets. Some limitations include:

  • Simple sentiment analysis lacks context and can be inaccurate. More sophisticated NLP methods like aspect-based sentiment analysis may be needed.
  • Tweets are not representative of the full population, only of Twitter users who choose to tweet about a topic. Results cannot be overgeneralized.
  • Basic word frequency and time series analysis cannot capture the full nuance of conversations on Twitter. Advanced techniques like topic modeling, network analysis, etc. are required for deeper insights.

Depending on your analysis goals, you can extend this basic approach in many ways – collecting more data, engineering custom features, building machine learning models, analyzing images/videos in tweets, and much more. Python‘s vast ecosystem of libraries provides support for nearly any type of data analysis you can think of.

Conclusion

In this guide, we covered the basics of analyzing Twitter data with Python – from setup and authentication to data collection, preprocessing, exploration, visualization, and sentiment analysis. While just a starting point, these techniques can help you go from raw tweets to valuable insights that can inform brand monitoring, trend detection, voice of customer analysis, and more.

To learn more, check out the official Twitter Developer Platform (https://developer.twitter.com/en) and the documentation for the Python libraries used here:

The possibilities are endless when it comes to mining insights from the rich firehose of data that is Twitter. This guide gives you a framework to start exploring this data on your own. Happy analyzing!

Similar Posts