Introduction

Creating a corpus can be a tedious task: collecting every single piece of data, fitting it together, making sure everything is in the right format with the right structure… Python is a great ally for this purpose. It makes possible to collect thousands of data in seconds and at the same time shape it the way it fits better our needs.

Prerequisites

Basic knowledge of HTTP protocol
Basic knowledge of XML structure
Basic knowledge of Python

1 Data profile

First of all, it is fundamental that we make clear some aspects from the beginning:

Which kind of data do I want? Is it oral? Is it written? Both categories can also be formal, informal, etc.
Where can I easily get those data? There exist many ways to get data from the internet. However, going for data that is nearly or already processed will often be our best try.
How many data? Don’t forget to define the right amount of data to collect. It is always better to look for a lesser amount in the first place and come back for more later than to end up with an unaccesible corpus.

For this ocasion, we will be collecting formal written data in the form of news headlines. We will make use of online newspapers’ RSS feed, where one can reach the daily news processed in XML. We will be querying several newspapers. So our first task is to create a little database with the newspapers data. The pieces of data we are interested in are the newspaper name and the link to the RSS feed where we will get the news. Let’s create a JSON file called np_db.json:

[
  {
    "id": 1,
    "name": "ElDiario.es",
    "url": "https://www.eldiario.es/rss/"
  },
  {
    "id": 2,
    "name": "El País",
    "url": "https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada"
  },
  {
    "id": 3,
    "name": "La Vanguardia",
    "url": "https://www.lavanguardia.com/mvc/feed/rss/home"
  }
]

Observe how we have added an ID to each newspaper in order to easily refer to them in our corpus.

2 Querying the newspapers

To fetch the data from the RSS sources we will be using get() function from Python’s requests library, which allows to send HTTP GET requests to the internet and get the response. We will be iterating our list of sources and querying the corresponding URL. In order to load a JSON file in Python, we will be using load() function from json library.

from json import load
from requests import get

with open('np_db.json', 'r') as j:
  newspapers = load(j)

for newspaper in newspapers:
  rss = get(newspaper['url']).content

First, we load all the JSON file content. load() function makes it available as a native Python list of dictionaries. Then, we loop through the newspapers list and send a GET request to the URL of the current newspaper. We save the body of the response (.content attribute) within rss variable.

3 Processing XML

In order to access XML structure easily, we will use Python’s XMLElementTree library. This will allows us to go through the XML code as a tree structure. We will create a function that processes the article data and returns it as a Python dictionary.

import xml.etree.ElementTree as ET

def process_rss(rss):
  headlines = list()
  tree = ET.fromstring(rss)
  for item in tree.iter("item"):
    headline = {
      'date': item.find("pubDate").text,
      'title': item.find("title").text
    }
    headlines.append(headline)
  return headlines

We first process XML with fromstring() function, which takes the XML code as plain content and turns it into an XML structured object. Then we create an iterator object with all the pieces of news (item tags in the RSS). Iterating through each of them, we gather the date and the title. Finally, we append the headline dictionary to the list of headlines and, when the for loop is done, we return that list.

Now, we just have to add the function to the beginning of our script and call it below:

from json import load
from requests import get
import xml.etree.ElementTree as ET

def process_rss(rss):
  headlines = list()
  tree = ET.fromstring(rss)
  for item in tree.iter("item"):
    headline = {
      'date': item.find("pubDate").text,
      'title': item.find("title").text
    }
    headlines.append(headline)
  return headlines

with open('np_db.json', 'r') as j:
  newspapers = load(j)

for newspaper in newspapers:
  rss = get(newspaper['url']).content
  headlines = process_rss(rss)

Finally, we just have to write all the headlines to a file. We will create a TSV file as it can be handled with Excel. By using datetime library, we will timestamp the output file. Take a look at the final result:

from datetime import datetime
from json import load
from requests import get
import xml.etree.ElementTree as ET

def process_rss(rss):
  headlines = list()
  tree = ET.fromstring(rss)
  for item in tree.iter("item"):
    headline = {
      'date': item.find("pubDate").text,
      'title': item.find("title").text
    }
    headlines.append(headline)
  return headlines

with open('np_db.json', 'r') as j:
  newspapers = load(j)

with open(f'headlines_{datetime.timestamp(datetime.now())}.tsv', 'w') as w:
  w.write('source\tdate\ttext\n')
  for newspaper in newspapers:
    rss = get(newspaper['url']).content
    headlines = process_rss(rss)
    for headline in headlines:
      w.write('{source}\t{date}\t{text}\n'.format(source=newspaper['id'], date=headline['date'], text=headline['title']))

Check how we write the TSV headings in the first place and then, query the RSS data. After getting a newspaper headlines, we write them down in the output file, which will end up looking similar to this:

source	date	text
1	Thu, 29 Oct 20 22:04:40 +0000	Toda España salvo cuatro comunidades se acoge al estado de alarma para encerrarse antes del puente
1	Fri, 30 Oct 20 08:07:41 +0000	El PIB registra un crecimiento histórico del 16,7% en el tercer trimestre empujado por el consumo
1	Fri, 30 Oct 20 10:45:28 +0000	La Xunta cierra las siete principales ciudades de Galicia y prohíbe las reuniones de no convivientes hasta el martes
2	Fri, 30 Oct 2020 15:47:19 GMT	El cierre por días de Madrid ahonda la brecha entre PP y Ciudadanos
2	Fri, 30 Oct 2020 16:20:28 GMT	El estado de alarma por coronavirus, en directo \| Illa confía en que no va a ser necesario llegar a confinamientos domiciliarios
2	Fri, 30 Oct 2020 16:31:47 GMT	Un fuerte seísmo sacude la costa turca del Egeo y causa varios muertos y decenas de heridos
3	Fri, 30 Oct 2020 15:03:43 +0100	Illa descarta confinamientos domiciliarios: “No será necesario llegar ahí”
3	Fri, 30 Oct 2020 12:01:42 +0100	La Xunta ordena el cierre perimetral de todas las ciudades para el puente
3	Fri, 30 Oct 2020 07:33:28 +0100	Tarjetas de crédito para pagar sin intereses en el Black Friday y Navidad

Because of space, I have shorten the file to just nine lines.

Conclusion

In this tutorial we have learnt how to create a huge corpus fetching data from online newspapers. This media usually have a RSS feed where we can access daily news processed in XML. Collecting thousands of news’ headlines is as easy as coming up with the list of media we want to query, making some GET requests and processing the XML response to readable data.

Corpus linguistics: Data collection (I)

Creating a corpus of news headlines with Python