Natural Language Processing (NLP) Tutorial with Python & NLTK
Natural language processing, or NLP for short, is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models to process human language and speech.
NLP has many exciting real-world applications, including:
- Chatbots and virtual assistants (like Siri or Alexa)
- Sentiment analysis (determining the emotion of text)
- Machine translation (automatically translating between languages)
- Automatic summarization (generating summaries of long articles)
- Named entity recognition (identifying people, places, and organizations mentioned in text)
- Question answering
- Text classification (organizing text into categories)
In this tutorial, we‘ll walk through the main steps of processing text with NLP, using the Python programming language and the Natural Language Toolkit (NLTK) library. By the end, you‘ll have a solid understanding of the fundamentals of NLP and be able to apply these techniques to your own projects. Let‘s dive in!
Main Steps in NLP
There are five core steps in the NLP process:
- Tokenization
- Stemming
- Lemmatization
- Part-of-speech tagging
- Named entity recognition
Let‘s explore each of these in more detail.
Tokenization
Tokenization is the process of splitting text into words, phrases, symbols, or other meaningful elements called tokens. The goal of tokenization is to explore the words in a sentence.
For example, consider the sentence: "NLP is an interesting field." Tokenizing this sentence would result in the following list of tokens: [‘NLP‘, ‘is‘, ‘an‘, ‘interesting‘, ‘field‘, ‘.‘].
Stemming
Stemming is the process of reducing words to their word stem, base, or root form. For example, the words "learns", "learning", and "learned" would all be reduced to the stem "learn".
The main goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. This base form doesn‘t need to be a valid English word. For example, the words "computation" and "compute" reduce to the stem "comput".
Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. A lemma is the canonical form of a set of words.
For example, the words "is", "was", "were", and "am" are all forms of the lemma "be". Or the words "great" and "greater" both have the lemma "great", whereas "good" and "better" belong to the lemma "good".
The main difference between stemming and lemmatization is that lemmas are actual words that appear in the dictionary, while stems may not be actual words.
Part-of-speech Tagging
Part-of-speech tagging, or POS tagging for short, is the process of assigning a part of speech (like noun, verb, adjective, etc.) to each token in the text.
POS tagging can provide additional information about the meaning of words and how they relate to each other. For example, consider the sentence "The dog chased the cat." A POS tagger would assign the following tags:
The (determiner) dog (noun) chased (verb) the (determiner) cat (noun).
Named Entity Recognition
Named entity recognition (NER) is the process of locating and classifying named entities in text into predefined categories such as person names, organizations, locations, time expressions, quantities, monetary values, percentages, etc.
For example, given the sentence "Microsoft was founded by Bill Gates and Paul Allen in 1975.", an NER system should be able to identify that "Microsoft" is an organization, "Bill Gates" and "Paul Allen" are people, and "1975" is a time.
NER can help extract key information from unstructured text and is used in many applications like information retrieval, question answering, and text summarization.
NLP with Python and NLTK
Now that we‘ve covered the main concepts in NLP, let‘s see how to implement them using Python and the Natural Language Toolkit (NLTK).
NLTK is an open source Python library for NLP. It provides easy-to-use interfaces and a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more.
Setting up your environment
Before we dive into the code examples, let‘s make sure you have Python and NLTK set up on your machine. If you haven‘t installed Python yet, you can download it from the official Python website: https://www.python.org/downloads/
Once you have Python installed, you can install NLTK using pip, the Python package installer. Open a terminal or command prompt and run:
pip install nltk
We‘ll also be using Jupyter Notebook to write and run our Python code. Jupyter Notebook is a web-based interactive development environment for creating notebook documents that contain live code, equations, visualizations, and narrative text.
To install Jupyter Notebook, run:
pip install notebook
To start a new notebook, open a terminal or command prompt, navigate to the directory where you want to store your notebook, and run:
jupyter notebook
This will launch the Jupyter Notebook interface in your default web browser. Click the "New" button and select "Python 3" to create a new Python notebook.
Code Examples
Let‘s walk through some code examples to see how to perform common NLP tasks using NLTK.
Tokenization with NLTK
NLTK provides a number of tokenizers in the nltk.tokenize module. Here‘s an example of how to use the word_tokenize function to tokenize a sentence:
from nltk.tokenize import word_tokenize
sentence = "NLP is an interesting field."
tokens = word_tokenize(sentence)
print(tokens)
Output:
[‘NLP‘, ‘is‘, ‘an‘, ‘interesting‘, ‘field‘, ‘.‘]
Stemming with NLTK
NLTK provides several stemmers in the nltk.stem package. Here‘s an example of how to use the PorterStemmer to stem a list of words:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["python", "pythoning", "pythoned", "pythoner", "pythonly"]
for word in words:
print(stemmer.stem(word))
Output:
python
python
python
python
pythonli
Lemmatization with NLTK
NLTK provides the WordNetLemmatizer in the nltk.stem package for lemmatization. Here‘s an example:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["cats", "cacti", "geese", "rocks", "oxen"]
for word in words:
print(lemmatizer.lemmatize(word))
Output:
cat
cactus
goose
rock
ox
POS Tagging with NLTK
NLTK provides the pos_tag function for POS tagging. Here‘s an example:
import nltk
sentence = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(sentence)
print(nltk.pos_tag(tokens))
Output:
[(‘The‘, ‘DT‘), (‘quick‘, ‘JJ‘), (‘brown‘, ‘JJ‘), (‘fox‘, ‘NN‘), (‘jumps‘, ‘VBZ‘), (‘over‘, ‘IN‘), (‘the‘, ‘DT‘), (‘lazy‘, ‘JJ‘), (‘dog‘, ‘NN‘), (‘.‘, ‘.‘)]
Named Entity Recognition with NLTK
NLTK provides the ne_chunk function for named entity recognition. Here‘s an example:
import nltk
sentence = "Microsoft was founded by Bill Gates and Paul Allen in 1975."
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
print(entities)
Output:
(S
(ORGANIZATION Microsoft/NNP)
was/VBD
founded/VBN
by/IN
(PERSON Bill/NNP Gates/NNP)
and/CC
(PERSON Paul/NNP Allen/NNP)
in/IN
(DATE 1975/CD)
./.)
Advanced Topics
Once you‘ve mastered the basics of NLP, there are many more advanced topics to explore:
Sentiment Analysis
Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text, whether positive, negative, or neutral. NLTK provides a number of tools for sentiment analysis, including the SentimentIntensityAnalyzer class.
Text Classification
Text classification is the task of assigning predefined categories to text documents. This can be useful for applications like spam detection, sentiment analysis, and topic labeling. NLTK provides a number of tools for text classification, including the NaiveBayesClassifier class.
Topic Modeling
Topic modeling is a type of statistical modeling for discovering the abstract "topics" that occur in a collection of documents. Popular algorithms for topic modeling include Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). The Gensim library provides implementations of these algorithms in Python.
Additional Resources
If you want to dive deeper into NLP and NLTK, here are some great resources to check out:
- The NLTK book: https://www.nltk.org/book/
- Stanford NLP course: https://www.youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z
- Coursera NLP specialization: https://www.coursera.org/specializations/natural-language-processing
- spaCy Python library: https://spacy.io/
- Gensim Python library: https://radimrehurek.com/gensim/
With the skills you‘ve learned in this tutorial and the wealth of resources available, you‘re well on your way to becoming an NLP expert. Happy learning!