Fathooo
Menu
Contactar

News extraction process through scraping at Diario La Discusión

Scraping system to extract news, links, and content from the Diario La Discusión.

BeautifulSoup Python Requests SQLite Scrapy

About the project

- View on GitHub

Last Test of the Script - 03 / 03 / 2022

This script is a data extractor (scraping), in this case, I extracted the content from the newspaper https://www.ladiscusion.cl, which is a newspaper from my city. This scraping is divided into two parts:

Before starting, we check robots.txt and verify the permissions granted by the newspaper.

2020220303132944.webp

Once ready, we continue.

  • The script creates two folders, data and data_content. In the data folder, categories with links will be stored in a DataFrame, and in data_content, the content will be stored.

  • The first step to use the script is to select the first option:

2020220303133359.webp

  • Once the command [1] is entered, the following will appear:

2020220303133939.webp

  • If command 1 is entered, it will start scraping all the links from the navigation bar. 2020220303132553.webp

  • However, option 2 allows us to scrape only the data we want. For demonstration, I will show images of option 2.

2020220303134235.webp

  • In the previous case, I entered that I only want 0, 3, and 4. I then ran the code, and it started generating the tables.

2020220303134554.webp

Once the first process is completed.

  • We will have the data folder with the DataFrames.

2020220303152156.webp

  • Here is a glimpse of the first DataFrame.

2020220303140442.webp

Second Step - Creation of DataFrames with Title - Time - Content - Subtitles

  • With the first step, we will have all the news links from the page that were available at the time of executing the script.
  • The next step is simple.
  • In the script menu, there is option [2]. 2020220303153014.webp
  • This option displays another menu, which will use all the links found in the files in the data folder.
  • Option [1] to process all of them.
  • Option [2] to scrape specific files.
  • I will use option two as a test.

2020220303153451.webp

  • We can observe that some links cannot be opened; however, the script continues.

2020220303153805.webp

  • Once finished, in our data_content folder, we will have the DataFrames and content ready for manipulation.

2020220303154624.webp

  • We will look at the first one as an example of a DataFrame. 2020220303160128.webp

  • The df.info() as well. 2020220303160158.webp

Duration of the Scraping

  • 40 minutes, First part.
  • 10 hours, Second part.

Aspects to Optimize

  • Threads can be used to reduce scraping time by scraping in parallel.
  • Files that did not receive a GET request can be captured for independent scraping.
News extraction process through scraping at Diario La Discusión