Creating a corpus can be a tedious task: collecting every single piece of data, fitting it together, making sure everything is in the right format with the right structure… Python is a great ally for this purpose. It makes possible to collect thousands of data in seconds and at the same time shape it the way it fits better our needs.
- Basic knowledge of HTTP protocol
- Basic knowledge of XML structure
- Basic knowledge of Python
First of all, it is fundamental that we make clear some aspects from the beginning:
- Which kind of data do I want? Is it oral? Is it written? Both categories can also be formal, informal, etc.
- Where can I easily get those data? There exist many ways to get data from the internet. However, going for data that is nearly or already processed will often be our best try.
- How many data? Don’t forget to define the right amount of data to collect. It is always better to look for a lesser amount in the first place and come back for more later than to end up with an unaccesible corpus.
For this ocasion, we will be collecting formal written data in the form of news headlines. We will make use of online newspapers’ RSS feed, where one can reach the daily news processed in XML. We will be querying several newspapers. So our first task is to create a little database with the newspapers data. The pieces of data we are interested in are the newspaper name and the link to the RSS feed where we will get the news. Let’s create a JSON file called
Observe how we have added an ID to each newspaper in order to easily refer to them in our corpus.
To fetch the data from the RSS sources we will be using
get() function from Python’s requests library, which allows to send HTTP GET requests to the internet and get the response. We will be iterating our list of sources and querying the corresponding URL. In order to load a JSON file in Python, we will be using
load() function from json library.
from json import load
First, we load all the JSON file content.
load() function makes it available as a native Python list of dictionaries. Then, we loop through the newspapers list and send a GET request to the URL of the current newspaper. We save the body of the response (
.content attribute) within
In order to access XML structure easily, we will use Python’s XML
ElementTree library. This will allows us to go through the XML code as a tree structure. We will create a function that processes the article data and returns it as a Python dictionary.
import xml.etree.ElementTree as ET
We first process XML with
fromstring() function, which takes the XML code as plain content and turns it into an XML structured object. Then we create an iterator object with all the pieces of news (
item tags in the RSS). Iterating through each of them, we gather the date and the title. Finally, we append the headline dictionary to the list of headlines and, when the
for loop is done, we return that list.
Now, we just have to add the function to the beginning of our script and call it below:
from json import load
Finally, we just have to write all the headlines to a file. We will create a TSV file as it can be handled with Excel. By using datetime library, we will timestamp the output file. Take a look at the final result:
from datetime import datetime
Check how we write the TSV headings in the first place and then, query the RSS data. After getting a newspaper headlines, we write them down in the output file, which will end up looking similar to this:
|1||Thu, 29 Oct 20 22:04:40 +0000||Toda España salvo cuatro comunidades se acoge al estado de alarma para encerrarse antes del puente|
|1||Fri, 30 Oct 20 08:07:41 +0000||El PIB registra un crecimiento histórico del 16,7% en el tercer trimestre empujado por el consumo|
|1||Fri, 30 Oct 20 10:45:28 +0000||La Xunta cierra las siete principales ciudades de Galicia y prohíbe las reuniones de no convivientes hasta el martes|
|2||Fri, 30 Oct 2020 15:47:19 GMT||El cierre por días de Madrid ahonda la brecha entre PP y Ciudadanos|
|2||Fri, 30 Oct 2020 16:20:28 GMT||El estado de alarma por coronavirus, en directo | Illa confía en que no va a ser necesario llegar a confinamientos domiciliarios|
|2||Fri, 30 Oct 2020 16:31:47 GMT||Un fuerte seísmo sacude la costa turca del Egeo y causa varios muertos y decenas de heridos|
|3||Fri, 30 Oct 2020 15:03:43 +0100||Illa descarta confinamientos domiciliarios: “No será necesario llegar ahí”|
|3||Fri, 30 Oct 2020 12:01:42 +0100||La Xunta ordena el cierre perimetral de todas las ciudades para el puente|
|3||Fri, 30 Oct 2020 07:33:28 +0100||Tarjetas de crédito para pagar sin intereses en el Black Friday y Navidad|
Because of space, I have shorten the file to just nine lines.
In this tutorial we have learnt how to create a huge corpus fetching data from online newspapers. This media usually have a RSS feed where we can access daily news processed in XML. Collecting thousands of news’ headlines is as easy as coming up with the list of media we want to query, making some GET requests and processing the XML response to readable data.