Friday, August 18, 2023

Using Python and BeautifulSoup to web scrape Times of India news headlines

According to Wikipedia, "web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites." In this post, we will create a small program in Python to scrape top headlines from Times of India's news headlines page using the BeautifulSoup library.

Sample webpage showing the top news headlines
Top news headlines

Particularly, our program will fetch the Times of India Headlines page and extract the prime news headlines on the top of the page. As of this writing, the page displays 6 headlines in that section which we want to scrape. In this screenshot of the webpage, our point of interest is the highlighted section which contains the top 6 headlines.

The programming language we will use is Python 3. Along with that, we will also use the BeautifulSoup 4 package for parsing the HTML. I will assume that you already have a system on which these prerequisites are installed and ready to run. I will also assume that you have a Python editor and compiler to compile and run the program. For the purpose of this illustration, I will use Google Colab to write and execute the python code.

Part 1: Scrape the website

To start with, we will write a simple code that fetches the data and outputs the scraped text in the editors output window. Type the following code in your Python editor. I will explain the code later in this post. A copy of this code is available at my TOITopHeadlines v1.0 repository on Github.


# This program scrapes a web page from Times of India to extract
# top headlines and prints it in the output window.

import requests
from bs4 import BeautifulSoup

def toi_topheadlines():
  url = "https://timesofindia.indiatimes.com/home/headlines"
  page_request = requests.get(url)
  page_content = page_request.content
  soup = BeautifulSoup(page_content,"html.parser")

  count = 1

  for divtag_c02 in soup.find_all('div', {'id': 'c_02'}):
    for divtag_0201 in divtag_c02.find_all('div', {'id': 'c_0201'}):
      divtag_hwdt1 = divtag_0201.find('div', {'id': 'c_headlines_wdt_1'})
      for divtag_topnl in divtag_hwdt1.find_all('div',
       {'class': 'top-newslist'}):
        for ultag in divtag_topnl.find_all('ul',{'class': 'clearfix'}):
          for litag in ultag.find_all('li'):
            for spantitle in litag.find_all('span', {'class': 'w_tle'}):
              href = spantitle.find('a')['href']
              if href.find("/", 0) == 0:
                href = "https://timesofindia.indiatimes.com" + href
                print(str(count) + ". " + spantitle.find('a')['title'] +
                      " - " + href)
                count = count + 1

if __name__ == "__main__":
  toi_topheadlines()

print("\n" + "end")

Executing the code will make it extract the HTML from the URL, parse out the required data and output the list of news headline titles and respective URLs as highlighted in the screenshot below:

News headlines scraped using Python

If you have managed to get that working, congratulations. You have scraped the top headlines and now you can use it in your own creative ways. Next, we will delve into what we did and what got us here.

Now, take a look at the portion of the source code that goes through a chain of for loops to crawl into the HTML tags. This exactly corresponds to the way that the markups are structured in the web page. You can take a look at the HTML markups by going to the browser's Developer Tools and inspecting the code behind the UI elements.

Inspecting HTML tag structure

Your program has to be tuned according to the HTML markup structure of the page that you are trying to scrape.

Part 2: Write the scraped data to a file in Google Drive

Now that our program can successfully scrape the data, in this section, we will take a step forward and write the scraped data into a JSON file in Google Drive. We will continue to use Google Colab to run the program.

For this we will mount a root folder in Google Drive and create a folder to store our files. We use a list of dictionaries and then use Python's JSON library to write the list to a JSON file. A copy of this code is available at my TOITopHeadlines v2.0 repository on Github.


# This program scrapes a web page from Times of India to extract
# top headlines and writes it to a JSON file in Google Drive.

import requests
import datetime
import json
from bs4 import BeautifulSoup

# Prepare file location
import os
from google.colab import drive
strDriveMountLoc = '/content/drive'
strDriveTargetLoc = "/content/drive/My Drive/WebScrape/DataNewsScrapeTOI"
# Mount Google Drive
drive.mount(strDriveMountLoc)
# Create a folder in the root directory
!mkdir -p "/content/drive/My Drive/WebScrape/DataNewsScrapeTOI"

def toi_topheadlines():
  # Generate output filename based on the date and time
  dt = datetime.datetime.now()
  filename = "toi_topheadlines" + dt.strftime("%Y%m%d%H%M%S") + ".json"

  url = "https://timesofindia.indiatimes.com/home/headlines"
  page_request = requests.get(url)
  page_content = page_request.content
  soup = BeautifulSoup(page_content,"html.parser")

  count = 1
  txtscraped = ""
  headlines = []

  for divtag_c02 in soup.find_all('div', {'id': 'c_02'}):
    for divtag_0201 in divtag_c02.find_all('div', {'id': 'c_0201'}):
      divtag_hwdt1 = divtag_0201.find('div', {'id': 'c_headlines_wdt_1'})
      for divtag_topnl in divtag_hwdt1.find_all('div',
       {'class': 'top-newslist'}):
        for ultag in divtag_topnl.find_all('ul',{'class': 'clearfix'}):
          for litag in ultag.find_all('li'):
            for spantitle in litag.find_all('span', {'class': 'w_tle'}):
              href = spantitle.find('a')['href']
              if href.find("/", 0) == 0:
                href = "https://timesofindia.indiatimes.com" + href
                print(str(count) + ". " + spantitle.find('a')['title'] +
                      " - " + href)
                thisheadline = {
                    "sn": count,
                    "title": spantitle.find('a')['title'],
                    "href": href
                }
                headlines.append(thisheadline)

                count = count + 1

  with open(strDriveTargetLoc + '/' + filename, "a") as f:
    f.write(json.dumps(headlines, indent=2))

if __name__ == "__main__":
  toi_topheadlines()

print("\n" + "end")

Executing the code in Google Colab will display a prompt to connect to Google Drive and then take you through a series of pages to authenticate using your Google id. Once you are past the authentication the code should execute and create a JSON file in the folder path that you chose in the program. Below you can see how a list of files would look like.

JSON files in Google Drive

The content of the JSON file would look similar to what you see below.

The JSON output

In Conclusion

A word of caution on using the web scraping method is that while many websites don't mind, there are many who don't like it. It is best to go through their terms of service to understand the limitations they apply to what you can do with the data and ensure that you are not in violation. Another important point to remember is that many websites periodically change their look and feel hence modifying the structure of the HTML. On the face of such changes, your web scraping logic may fall flat, hence web scrapers need continuous maintenance. A better way to capture and harvest such data is to use APIs published by the web sites. This demonstration is for academic purposes only.

Happy scraping!

No comments:

Post a Comment