How I Scraped 7000 Articles From a Newspaper Website Using Node

As developers, we‘re always looking for interesting datasets to work with. One treasure trove of data is news articles. They contain a wealth of information on important people, organizations, events and trends over time.

So when a family member mentioned they needed to analyze thousands of news articles for a research project, I was intrigued. The goal was to scrape over 7,000 articles from a major newspaper website, going back several decades.

While 7,000 articles is a lot for a human, it‘s not so bad for a computer! I realized this would be a great opportunity to flex my web scraping skills using Node.js.

In this post, I‘ll walk through how I built a Node script to automatically scrape the articles. I‘ll cover:

  • Compiling the list of article URLs to scrape
  • Logging in to the newspaper website to access full article content
  • Fetching and parsing the individual article pages
  • Saving the scraped article data in a structured format

Let‘s dive in! But first, a quick overview of the tools I used:

  • Node.js – Server-side JavaScript runtime
  • request – Library for making HTTP requests
  • Cheerio – Library for parsing and querying HTML using jQuery-like syntax

Gathering Article URLs

The first step was to compile a list of the article URLs to scrape. Luckily, the newspaper website had a search feature that would show a list of matching article headlines, URLs, and metadata.

By searching for relevant keywords, I was able to get the URLs for all 7,000+ articles. I captured the search results pages using the following code:


const request = require(‘request‘);
const cheerio = require(‘cheerio‘);

function getArticleUrls(searchUrl, page) {

const pageUrl = searchUrl + ‘&page=‘ + page;

request(pageUrl, (error, response, html) => {

    if (!error) {

        const $ = cheerio.load(html);

        const articleLinks = $(‘.article-link‘);

        articleLinks.each((i, link) => {

            const articleUrl = $(link).attr(‘href‘);
            const headline = $(link).text();

            articleUrls.push({
                url: articleUrl,
                headline: headline
            });
        });

        // Check if there are more search result pages
        if ($(‘.next-page‘).length > 0) {
            getArticleUrls(searchUrl, page + 1);
        }
        else {
            console.log(‘Found ‘ + articleUrls.length + ‘ article URLs‘);
            getArticleContent();
        }
    } 
    else {
        console.log("Error fetching search results: " + error);
    } 
});

}

This code recursively fetches each search results page, parses it using Cheerio, and extracts the article URLs and headlines. It keeps fetching pages until there are no more results.

The article URLs and headlines are stored in an articleUrls array for the next step.

Logging In and Accessing Full Articles

I soon hit a roadblock. The newspaper‘s search results only showed article summaries. When I tried to access the full articles, I ran into a paywall blocking unsubscribed users.

To get the full article content, I would need to log into the website first. This meant my scraper needed to submit login credentials and store cookies just like a browser.

By using Chrome Dev Tools to inspect the network traffic when logging in manually, I was able to see the POST request that submits the login credentials. It looked something like this:


POST /login HTTP/1.1
Host: www.newspaper.com
Content-Type: application/x-www-form-urlencoded

username=john123&password=abc123&token=c3neb13mvq1jh7ravrlpqb0ki4

In addition to the username and password, the login requires a CSRF token. This token is randomly generated and included in the login page‘s HTML.

To log in automatically, I needed my script to first load the login page, extract the CSRF token, and then submit the login POST request with the credentials and token.

Here‘s what that code looks like:


function loginToNewspaper(loginUrl, callback) {
// First load the login page
request(loginUrl, (error, response, html) => {

    if (!error) {

        // Extract CSRF token from HTML
        const $ = cheerio.load(html);
        const csrfToken = $(‘input[name="token"]‘).val();

        const loginCredentials = {
            username: process.env.NEWSPAPER_USER, 
            password: process.env.NEWSPAPER_PWD,
            token: csrfToken
        };

        // Submit login credentials 
        request.post({
            url: loginUrl,
            form: loginCredentials
        }, (error, response, body) => {

            if (!error) {
                console.log("Logged in successfully!");
                callback();
            }
            else {
                console.log("Login failed: " + error);
            }
        });
    }
    else {
        console.log("Failed to load login page: " + error);
    }
});

}

I stored the actual login credentials in environment variables to avoid hardcoding them.

The real magic here is the use of request.post(). By default, the request library does not store cookies. But by passing a jar option, it keeps cookies for subsequent requests:


const requestJar = request.jar();

request({ url: articleUrl, jar: requestJar }, (error, response, html) => {

// This request will send stored login cookies

});

With this in place, my script was able to log in and access full article pages.

Scraping Article Content

With the list of article URLs and the ability to log in, I was ready to start scraping actual article content.

For each article URL:

  1. I fetched the HTML of the full article page
  2. Used Cheerio to locate and extract the key article elements like headline, author, date, and article text
  3. Stored the extracted data in an Article object
  4. Saved the Article to an array

Here‘s a simplified version of the extraction code:


function getArticleContent(articleUrl, callback) {
request({url: articleUrl, jar: requestJar}, (error, response, html) => {

    if (!error) {

        const $ = cheerio.load(html);

        const article = {
            url: articleUrl,
            headline: $(‘.article-headline‘).text(),
            author: $(‘.article-author‘).text(), 
            date: $(‘.article-date‘).text(),
            text: $(‘.article-text‘).text()
        };

        // Push article to array
        articles.push(article);
        callback();
    }
    else {
        console.log("Error fetching article: " + error);
    }
}); 

}

The actual extraction was a bit more complex, as I had to clean up the extracted text and handle missing elements. But the core concept is the same – locate the desired elements using Cheerio selectors and extract their text.

To be respectful of the newspaper‘s servers, I added a delay between each article request using setTimeout().

Once all articles were scraped, I saved the articles array to a JSON file.

Final Thoughts

In the end, I was able to successfully scrape over 7,000 news articles from the past 30 years. It took some trial and error to get the login flow and article extraction working smoothly, but the final script ran without a hitch.

There are a few key takeaways from this project:

  • Web scraping is a powerful way to gather large datasets, but it comes with responsibility. It‘s important to respect website terms of service and copyright. Some best practices are limiting request rate, only extracting information you have a right to, and using scraped data responsibly.

  • When scraping websites that require login, you‘ll need to simulate the login process by inspecting the login form submission and handling cookies. Dev tools and the request library make this doable.

  • Expect to hit roadblocks and plan extra time for troubleshooting. It took a while for me to figure out the login flow and make the scraper resilient to network failures.

  • Look for ways to make your scraper smarter and more efficient. For example, I used caching to avoid re-fetching URLs.

The possibilities for extending this kind of scraper are endless. You could integrate natural language processing to analyze the extracted article text. Or hook up a database to store the scraped data for further querying and analysis.

I hope this post gave you a taste of the power of web scraping with Node.js. With a few core tools and techniques, you can gather comprehensive datasets to power all sorts of analysis and applications.

The complete source code for this project is available on my GitHub. Feel free to adapt it for your own web scraping needs.

Happy scraping!

Similar Posts