An In-Depth Guide to Part-of-Speech Tagging with Hidden Markov Models

As a developer working on any application that processes raw text data, you‘ve likely come across the task of part-of-speech tagging. Determining whether words are nouns, verbs, adjectives, and so on is a fundamental step in making sense of human language. By understanding the grammatical role of each word, we gain rich information about the meaning of a text that acts as a foundation for further analysis.

In this post, we‘ll dive deep into part-of-speech (POS) tagging techniques, with a focus on a probabilistic approach known as the Hidden Markov Model. We‘ll build an intuitive understanding of the core concepts through examples and analogies. By the end, you‘ll appreciate the elegance of the HMM approach and how it can be applied to tag parts of speech in a text corpus.

What is Part-of-Speech Tagging?

Part-of-speech tagging is the process of labeling each word in a text with its corresponding part of speech (noun, verb, adjective, etc.). The set of tags to be used depends on the tagging scheme. A simplified universal tag set could look like this:

NOUN (nouns)
VERB (verbs)
ADJ (adjectives)
ADV (adverbs)
PRON (pronouns)
PREP (prepositions)
CONJ (conjunctions)

A key challenge in POS tagging is that many words can take on different parts of speech depending on the context. For example, consider the following sentences:

I park my car in the garage.
Let‘s go to the park for a picnic.

The word "park" acts as a verb in the first sentence and as a noun in the second. As humans, we can easily disambiguate such cases based on the surrounding words. A POS tagger aims to do the same automatically.

Applications of POS Tagging

Before diving into the technical approaches, let‘s highlight some popular applications of POS tagging:

Text-to-Speech

Words that can be different parts of speech are often pronounced differently. For example:

I refuse to go. (re-FYOOZ, verb)
Dispose of the refuse properly. (REF-yooss, noun)

To generate the correct pronunciation, a text-to-speech system must first infer the part of speech of the word based on its context.

Word Sense Disambiguation

Many words have multiple meanings. POS tags can help distinguish between different senses. For instance:

The river bank is full of fish. (noun)
You can bank on him to get the job done. (verb)

Named Entity Recognition

POS tags provide useful information for identifying named entities like person names (usually nouns/proper nouns), organizations, locations, etc.

Sentiment Analysis

Sentiments are often expressed through adjectives (e.g. "good", "bad") and sometimes adverbs (e.g. "horribly"). POS tags help locate opinion words in a text.

Rule-based POS Tagging

Early POS taggers relied on hand-crafted rules to disambiguate words. For example, a rule could encode that if a word follows "the", it‘s likely to be a noun.

A popular example is the Brill tagger, which starts by assigning each word its most common POS tag, and then applies a sequence of rules to correct mistakes. These rules are automatically learned from a pre-tagged corpus, based on templates like:

Change tag a to b when the previous word is tagged z.
Change tag a to b when the next word is w.

While rule-based approaches can achieve decent accuracy, they require significant effort to develop and maintain the rules, especially as new words arise over time. This has motivated the development of probabilistic techniques.

Stochastic POS Tagging with Hidden Markov Models

Probabilistic or stochastic taggers aim to find the tag sequence that is most probable given the observed words. Among these, the Hidden Markov Model (HMM) is the most widely used approach for POS tagging.

To understand how HMMs work, let‘s first revisit the concept of Markov chains.

Markov Chains: A Quick Refresher

A Markov chain models a system that can be in one of a set of discrete states at any given time, and transitions between these states with certain probabilities.

As an example, let‘s model the weather in a simple universe that only has three possible states: Sunny, Cloudy, and Rainy. We can represent the probabilities of transitioning from one state to another in a transition matrix:

According to this model, if it‘s sunny today, there‘s a 0.7 probability that it will be sunny tomorrow, a 0.2 probability of being cloudy, and a 0.1 probability of rain.

A key property of Markov chains is that the probability of transitioning to a state depends only on the current state, and not on the history of previous states. This is known as the Markov property:

P(Statet | State{t-1}, …, State_1) = P(Statet | State{t-1})

In our weather example, this means that the probability of rain tomorrow depends only on today‘s weather, and not on past days. While this is a simplification of reality, it makes the model computationally tractable.

From Markov Chains to Hidden Markov Models

In a regular Markov model, the states are directly observable. But in a Hidden Markov Model, the states are "hidden" and only emit observations with certain probabilities.

Let‘s make this concrete with an analogy. Suppose you want to model whether your friend Alice is happy or sad on any given day, but you can only observe her actions, like whether she smiles or frowns. In this case:

The hidden states are HAPPY and SAD
The observations are SMILE and FROWN

We can represent the emission probabilities (the probability of an observation given a state) in a matrix:

According to this, when Alice is happy, she smiles with a probability of 0.8 and frowns with a probability of 0.2.

The transition probabilities between states can be represented as before:

This completes the specification of our toy HMM. Given a sequence of observations (e.g. SMILE, FROWN, SMILE, SMILE), we can use the model to infer the most likely sequence of hidden states (e.g. HAPPY, SAD, HAPPY, HAPPY).

The Three Fundamental Problems in HMMs

There are three key problems that any HMM system must solve:

Given an HMM (states, observations, transition and emission probabilities) and an observation sequence, find the probability of the observation sequence. This is the evaluation problem.
Given an HMM and an observation sequence, find the most likely state sequence that produced those observations. This is the decoding problem, and is typically solved using the Viterbi algorithm.
Given just an observation sequence, learn the parameters (transition and emission probabilities) of the HMM. This is the learning problem, and is typically solved using the Baum-Welch algorithm.

For POS tagging, we‘re most interested in the decoding problem – given a sequence of words, we want to find the most likely sequence of POS tags.

Formulating POS Tagging as an HMM

To apply HMMs for POS tagging, we make the following correspondences:

Hidden states: The POS tags (e.g. NOUN, VERB, ADJ)
Observations: The words in the text
Transition probabilities: P(Tagi | Tag{i-1})
Emission probabilities: P(Word_i | Tag_i)

For example, the transition probability P(NOUN | VERB) represents the probability that a word is a noun given that the previous word is a verb. The emission probability P("dog" | NOUN) represents the probability of seeing the word "dog" given that the tag is a noun.

To actually perform POS tagging with an HMM, we:

Estimate the transition and emission probabilities from a labeled training corpus. This is the learning problem.
For a new sequence of words, find the tag sequence that maximizes the joint probability of the words and tags, using the estimated probabilities. This is the decoding problem, solved using the Viterbi algorithm.

Viterbi Algorithm: Decoding the Most Likely State Sequence

The Viterbi algorithm is a dynamic programming approach to find the most likely state sequence in an HMM, given an observation sequence.

The key idea is to compute, for each state and each step, the probability of the most likely state sequence that ends in that state and accounts for the first t observations. These probabilities are computed iteratively, using the transition and emission probabilities.

Instead of considering all possible state sequences (which grows exponentially with the sequence length), the Viterbi algorithm only keeps track of the best sequence ending in each state at each step. This makes it efficient to find the overall most likely state sequence.

Conclusion

Part-of-speech tagging is a critical task in natural language processing, with applications ranging from text-to-speech to sentiment analysis. While early approaches relied on hand-crafted rules, probabilistic models like Hidden Markov Models have become the standard, thanks to their ability to handle uncertainty and to learn from labeled data.

HMMs provide an elegant framework to model POS tagging, by treating the tags as hidden states and the words as observations. The Viterbi algorithm allows us to efficiently decode the most likely tag sequence for a given word sequence.

Of course, HMMs are not perfect. They make strong independence assumptions (the Markov property) that don‘t always hold in real language. More advanced models like Maximum Entropy Markov Models and Conditional Random Fields can relax these assumptions.

Nonetheless, understanding HMMs provides a solid foundation for working with sequential data, not just in NLP but also in fields like bioinformatics and speech recognition. I hope this post has demystified some of the core concepts and piqued your curiosity to dive deeper into the fascinating world of probabilistic models!

An In-Depth Guide to Part-of-Speech Tagging with Hidden Markov Models

What is Part-of-Speech Tagging?