Text Classification and Prediction Using the Bag-of-Words Approach

As a full-stack developer working with textual data, one of the most common tasks you‘ll encounter is text classification – automatically assigning predefined categories to documents or pieces of text. Some example applications include:

  • Sentiment analysis: Classifying reviews, tweets, or comments as positive, negative, or neutral
  • Topic categorization: Organizing news articles or web pages by subject matter
  • Spam filtering: Detecting and filtering out spam emails
  • Author identification: Attributing documents to specific authors based on writing style

While there are many different approaches to text classification, one of the simplest and most widely used is the bag-of-words model. In this article, we‘ll dive into how the bag-of-words model works, walk through a step-by-step example of using it for text classification in Python, discuss its strengths and weaknesses, and explore some extensions and alternatives. Let‘s get started!

What is the Bag-of-Words Model?

The bag-of-words (BoW) model is a way of representing text data when modeling text with machine learning algorithms. The basic idea is:

  1. Break text into individual tokens (words, phrases, or symbols)
  2. Build a vocabulary of known words from your training corpus
  3. Model each document by counting the number of times each word appears
  4. Discard word order and grammar – each document becomes a "bag" of independent words

As an example, consider these two simple documents:

  • D1: "John likes to watch movies. Mary likes movies too."
  • D2: "John also likes to watch football games."

Based on these documents, our vocabulary is: [John, likes, to, watch, movies, Mary, too, also, football, games]. We can then represent each document as a vector of word counts:

D1: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
D2: [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

Where each position in the vector corresponds to a word in our vocabulary. Any words not present in a given document have a count of zero. Even though the ordering of words in lost, we can still see that these two documents have some words in common and some differences.

The beauty of the BoW model is in its simplicity – we can take complex, messy textual data and represent it in a structured way that machine learning models can work with, while preserving enough information to be useful for classification tasks.

Implementing Bag-of-Words in Python

Now that we understand how the bag-of-words model works conceptually, let‘s see how to implement it in Python and use it for a basic text classification task. We‘ll use the 20 Newsgroups dataset, a collection of roughly 20,000 newsgroup documents across 20 different topics.

Step 1: Load and Preprocess Data

First we‘ll load the dataset and do some light preprocessing – removing headers, footers, and quotes so we‘re left with just the content of each document:

from sklearn.datasets import fetch_20newsgroups

categories = [‘alt.atheism‘, ‘comp.graphics‘, ‘sci.med‘, ‘soc.religion.christian‘]
newsgroups_train = fetch_20newsgroups(subset=‘train‘, categories=categories, remove=(‘headers‘, ‘footers‘, ‘quotes‘))
newsgroups_test = fetch_20newsgroups(subset=‘test‘, categories=categories, remove=(‘headers‘, ‘footers‘, ‘quotes‘))

print(f"Training set size: {len(newsgroups_train.data)}")
print(f"Test set size: {len(newsgroups_test.data)}")

Output:

Training set size: 2257
Test set size: 1502

Step 2: Tokenize Text and Build Vocabulary

Next, we‘ll convert each document from raw text into a list of tokens. We‘ll use scikit-learn‘s CountVectorizer to tokenize the text and build a vocabulary:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words=‘english‘, min_df=5)  
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

Output:

Vocabulary size: 25719
X_train shape: (2257, 25719) 
X_test shape: (1502, 25719)

Here, the CountVectorizer:

  • Tokenizes each document into individual words
  • Builds a vocabulary of all words that appear in at least 5 documents (min_df=5)
  • Filters out common English stop words like "the", "and", "of" (stop_words=‘english‘)
  • Learns the vocabulary on the training set (fit_transform) then transforms the test set into vectors using that vocabulary (transform)

The result is that each document has been converted into a vector with 25,719 elements, one for each word in the vocabulary, where the value is the count of how many times that word appears. These vectors are stored as sparse matrices to save space.

Step 3: Train a Classifier

Now that we have our textual data encoded in a structured format, we can train a classifier. We‘ll use multinomial Naive Bayes, a simple probabilistic classifier often used for text classification:

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

model = MultinomialNB()
model.fit(X_train, newsgroups_train.target)

y_pred = model.predict(X_test)
print(classification_report(newsgroups_test.target, y_pred, 
      target_names=newsgroups_test.target_names))

Output:

                        precision    recall  f1-score   support

           alt.atheism       0.86      0.85      0.86       319
         comp.graphics       0.75      0.91      0.82       389
               sci.med       0.90      0.81      0.85       396
soc.religion.christian       0.88      0.78      0.83       398

              accuracy                           0.84      1502
             macro avg       0.85      0.84      0.84      1502
          weighted avg       0.85      0.84      0.84      1502

With no tuning, our simple bag-of-words model achieves 84% accuracy on the test set! The precision, recall and F1 scores are also quite good for each individual category.

Of course, this is a simplified example on a relatively small dataset – in practice, you‘ll want to evaluate multiple models, tune hyperparameters, deal with class imbalances, etc. But this demonstrates the power of the bag-of-words representation for text classification.

Advantages and Disadvantages of Bag-of-Words

So when is bag-of-words the right choice for modeling text? Let‘s look at some of its key strengths and weaknesses.

Advantages of bag-of-words:

  • Simplicity: BoW is conceptually simple and easy to implement, making it a good baseline approach
  • Efficiency: Transforming text to BoW vectors and training models like Naive Bayes is computationally efficient
  • Effectiveness: Despite its simplicity, BoW is surprisingly effective for many document-level classification tasks

Disadvantages of bag-of-words:

  • Loss of word order: By treating documents as unordered collections, BoW discards potentially meaningful information about the sequence and relationships between words
  • Loss of semantics: BoW considers words as independent features and can‘t account for synonyms, polysemy, or other semantic relationships
  • Sparsity and high dimensionality: Depending on your vocabulary size, BoW vectors can become very high dimensional and sparse (lots of zeros), which can challenge some ml algorithms

Improving on Bag-of-Words

There are a number of ways we can improve on the basic bag-of-words model while preserving its core simplicity:

  • TF-IDF weighting: Rather than just counting words, weight them by their term frequency-inverse document frequency (TF-IDF). This gives more weight to words that are frequent in a document but rare across the corpus.

  • N-grams: Instead of just single word tokens, use n-grams (e.g. bigrams, trigrams) to capture some local word order and phrases. For example, "not good" and "good" convey very different meanings that unigram BoW would miss.

  • Feature selection: Reduce dimensionality and sparsity by selecting only the most informative words as features. This could be based on frequency thresholds, mutual information with the target class, or L1 regularization.

  • Combine with other features: BoW can be combined with other feature types like parts-of-speech tags, named entities, or metadata (e.g. document length, source, author) to provide additional signal.

Beyond Bag-of-Words

While bag-of-words can be surprisingly effective, some of its fundamental limitations like loss of word order and semantics can hinder performance on more complex NLP tasks. In recent years, neural network-based approaches using dense vector representations like word embeddings (word2vec, GloVe) and pre-trained language models (BERT, GPT) have become the state-of-the-art for many text classification problems.

These learned vector representations can capture rich semantic and syntactic relationships between words and be fine-tuned for specific tasks, often outperforming BoW-based models. However, they also require much more training data, compute power, and model complexity.

So when considering BoW vs. Word Embeddings vs. BERT for a given text classification problem, it‘s important to consider the complexity of your task, the size of your dataset, and the computational resources you have available. BoW remains a solid baseline and is often the right choice for simpler tasks.

Conclusion

The bag-of-words model is a powerful, intuitive approach for representing textual data in a structured format suitable for machine learning. While it has limitations, it remains a go-to method for many document classification tasks due to its simplicity, efficiency, and effectiveness. I hope this article has given you a practical understanding of how to implement BoW for text classification in Python and some insights into when and how to use it in your own NLP projects.

To learn more, check out the scikit-learn documentation on text feature extraction and this excellent introduction to the bag-of-words model from the book Feature Engineering for Machine Learning.

Happy coding and may your text classifiers achieve high precision and recall! Let me know in the comments if you have any questions or insights to share.

Similar Posts