Natural Language Processing (NLP) Tutorial with Python & NLTK

Natural language processing, or NLP for short, is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models to process human language and speech.

NLP has many exciting real-world applications, including:

  • Chatbots and virtual assistants (like Siri or Alexa)
  • Sentiment analysis (determining the emotion of text)
  • Machine translation (automatically translating between languages)
  • Automatic summarization (generating summaries of long articles)
  • Named entity recognition (identifying people, places, and organizations mentioned in text)
  • Question answering
  • Text classification (organizing text into categories)

In this tutorial, we‘ll walk through the main steps of processing text with NLP, using the Python programming language and the Natural Language Toolkit (NLTK) library. By the end, you‘ll have a solid understanding of the fundamentals of NLP and be able to apply these techniques to your own projects. Let‘s dive in!

Main Steps in NLP

There are five core steps in the NLP process:

  1. Tokenization
  2. Stemming
  3. Lemmatization
  4. Part-of-speech tagging
  5. Named entity recognition

Let‘s explore each of these in more detail.

Tokenization

Tokenization is the process of splitting text into words, phrases, symbols, or other meaningful elements called tokens. The goal of tokenization is to explore the words in a sentence.

For example, consider the sentence: "NLP is an interesting field." Tokenizing this sentence would result in the following list of tokens: [‘NLP‘, ‘is‘, ‘an‘, ‘interesting‘, ‘field‘, ‘.‘].

Stemming

Stemming is the process of reducing words to their word stem, base, or root form. For example, the words "learns", "learning", and "learned" would all be reduced to the stem "learn".

The main goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. This base form doesn‘t need to be a valid English word. For example, the words "computation" and "compute" reduce to the stem "comput".

Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. A lemma is the canonical form of a set of words.

For example, the words "is", "was", "were", and "am" are all forms of the lemma "be". Or the words "great" and "greater" both have the lemma "great", whereas "good" and "better" belong to the lemma "good".

The main difference between stemming and lemmatization is that lemmas are actual words that appear in the dictionary, while stems may not be actual words.

Part-of-speech Tagging

Part-of-speech tagging, or POS tagging for short, is the process of assigning a part of speech (like noun, verb, adjective, etc.) to each token in the text.

POS tagging can provide additional information about the meaning of words and how they relate to each other. For example, consider the sentence "The dog chased the cat." A POS tagger would assign the following tags:

The (determiner) dog (noun) chased (verb) the (determiner) cat (noun).

Named Entity Recognition

Named entity recognition (NER) is the process of locating and classifying named entities in text into predefined categories such as person names, organizations, locations, time expressions, quantities, monetary values, percentages, etc.

For example, given the sentence "Microsoft was founded by Bill Gates and Paul Allen in 1975.", an NER system should be able to identify that "Microsoft" is an organization, "Bill Gates" and "Paul Allen" are people, and "1975" is a time.

NER can help extract key information from unstructured text and is used in many applications like information retrieval, question answering, and text summarization.

NLP with Python and NLTK

Now that we‘ve covered the main concepts in NLP, let‘s see how to implement them using Python and the Natural Language Toolkit (NLTK).

NLTK is an open source Python library for NLP. It provides easy-to-use interfaces and a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more.

Setting up your environment

Before we dive into the code examples, let‘s make sure you have Python and NLTK set up on your machine. If you haven‘t installed Python yet, you can download it from the official Python website: https://www.python.org/downloads/

Once you have Python installed, you can install NLTK using pip, the Python package installer. Open a terminal or command prompt and run:

pip install nltk

We‘ll also be using Jupyter Notebook to write and run our Python code. Jupyter Notebook is a web-based interactive development environment for creating notebook documents that contain live code, equations, visualizations, and narrative text.

To install Jupyter Notebook, run:

pip install notebook

To start a new notebook, open a terminal or command prompt, navigate to the directory where you want to store your notebook, and run:

jupyter notebook

This will launch the Jupyter Notebook interface in your default web browser. Click the "New" button and select "Python 3" to create a new Python notebook.

Code Examples

Let‘s walk through some code examples to see how to perform common NLP tasks using NLTK.

Tokenization with NLTK

NLTK provides a number of tokenizers in the nltk.tokenize module. Here‘s an example of how to use the word_tokenize function to tokenize a sentence:

from nltk.tokenize import word_tokenize

sentence = "NLP is an interesting field."
tokens = word_tokenize(sentence)
print(tokens)

Output:

[‘NLP‘, ‘is‘, ‘an‘, ‘interesting‘, ‘field‘, ‘.‘] 

Stemming with NLTK

NLTK provides several stemmers in the nltk.stem package. Here‘s an example of how to use the PorterStemmer to stem a list of words:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["python", "pythoning", "pythoned", "pythoner", "pythonly"]

for word in words:
    print(stemmer.stem(word))

Output:

python
python
python
python
pythonli

Lemmatization with NLTK

NLTK provides the WordNetLemmatizer in the nltk.stem package for lemmatization. Here‘s an example:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["cats", "cacti", "geese", "rocks", "oxen"]

for word in words:
    print(lemmatizer.lemmatize(word))

Output:

cat
cactus
goose  
rock
ox

POS Tagging with NLTK

NLTK provides the pos_tag function for POS tagging. Here‘s an example:

import nltk

sentence = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(sentence)
print(nltk.pos_tag(tokens))

Output:

[(‘The‘, ‘DT‘), (‘quick‘, ‘JJ‘), (‘brown‘, ‘JJ‘), (‘fox‘, ‘NN‘), (‘jumps‘, ‘VBZ‘), (‘over‘, ‘IN‘), (‘the‘, ‘DT‘), (‘lazy‘, ‘JJ‘), (‘dog‘, ‘NN‘), (‘.‘, ‘.‘)]

Named Entity Recognition with NLTK

NLTK provides the ne_chunk function for named entity recognition. Here‘s an example:

import nltk

sentence = "Microsoft was founded by Bill Gates and Paul Allen in 1975."
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
print(entities)

Output:

(S
  (ORGANIZATION Microsoft/NNP)
  was/VBD
  founded/VBN
  by/IN
  (PERSON Bill/NNP Gates/NNP)
  and/CC
  (PERSON Paul/NNP Allen/NNP)
  in/IN
  (DATE 1975/CD)
  ./.)

Advanced Topics

Once you‘ve mastered the basics of NLP, there are many more advanced topics to explore:

Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text, whether positive, negative, or neutral. NLTK provides a number of tools for sentiment analysis, including the SentimentIntensityAnalyzer class.

Text Classification

Text classification is the task of assigning predefined categories to text documents. This can be useful for applications like spam detection, sentiment analysis, and topic labeling. NLTK provides a number of tools for text classification, including the NaiveBayesClassifier class.

Topic Modeling

Topic modeling is a type of statistical modeling for discovering the abstract "topics" that occur in a collection of documents. Popular algorithms for topic modeling include Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). The Gensim library provides implementations of these algorithms in Python.

Additional Resources

If you want to dive deeper into NLP and NLTK, here are some great resources to check out:

With the skills you‘ve learned in this tutorial and the wealth of resources available, you‘re well on your way to becoming an NLP expert. Happy learning!

Similar Posts