Introduction
So far we have learned how to collect different types of data from different sources and for different purposes. In this tutorial, we are going to put it into practice. We will create a specific corpus of tweets containing films and TV shows names that will later serve to concoct a language model for Named Entity Recognition.
Prerequisites
- Basic knowledge of Python
- Read previous wikis:
Putting things together
As we already know, Twitter API will only respond to literal string queries of sequences. It doesn’t support regular expressions nor other complex forms of querying. Also, we have to bear in mind that Twitter offers a limited quota of tweets per request. Consequently, we will have to get a sample of titles before running any query and then send a single request per title so we can make the most of each call.
Getting the titles
In order to get relevant titles we will retrieve the current most popular ones via TMDB (The Movie Database) API. To use that API, we just have to sign up and apply for an API key. There are some good amount of endpoints to get different data. We will be using /discover/tv
and /discover/movie
, which allow to query for TV shows and movies by popularity.
As we already know, HTTP requests in Python are carried out using requests library. We will be sending some filters to the API:
- To retrieve results in Spanish we will use
language
filter with the valuees-ES
. - TMDB API returns paginated results. This means we don’t get the whole list in a round but just a page with some of them. If we want more, we need to request the next page. We will use
page
filter to get each page’s results. - As Twitter will stop returning tweets whenever we exceed the free quota and that might happen anywhere in the middle of the list of titles, we will use the filter
sort_by
with the valuepopularity.desc
to receive results from the most popular to the least. - Our API key has to be sent as a filter.
The TMDB API endpoints that we are querying receive data in the URL query (check out URL syntax on Wikipedia), which require some formatting and string encoding. We will use urllib.parse‘s urlencode() function to render our query.
1 | from urllib.parse import urlencode |
Observe how page
value is not passed as a string but as a variable. This will allow us to increment it to get all the results in each page. Don’t forget to substitute <API_KEY>
with your API key.
We will append the query to the API URL, send a request to both movies and TV shows endpoints and keep the body of the response using the .text
method. As the response is JSON structured, we will also use loads()
function from json library to turn that JSON into a Python dictionary. That way we will be able to access the content easily:
1 | from json import loads |
That loop will run twice: one sending the GET request to https://api.themoviedb.org/3/discover/movie
and the other to https://api.themoviedb.org/3/discover/tv
. TMDB responses don’t handle the results directly. They look as follows.
1 | { |
That’s why, after loading JSON (loads()
), we keep only results
key.
In order to save each result’s title, we can use Python 3’s comprehensive lists. We have to take care of an issue: TMDB stores titles differently. TV shows titles are stored in a key named name
and film titles under title
. We will have to control this in our code. As there are only two options, we will be using try
and except
. If title
key is found, it’s a film. Otherwise, it’s a TV show. We save all the results in the titles
list.
1 | titles = list() |
We will have to do this as many times as pages we get. As we saw in the API response above, there are 500 pages, so we just have to include all the code in a loop that iterates from 1 to 500. We will create a function that returns the list with all the titles in it.
1 | def getTitles(): |
This function will return around 20 thousand titles.
Getting tweets
To query Twitter for data containing our titles, we will use Tweepy. Remember there are some previous requirements to be met. Check out Creating a corpus of tweets with Python post for further information. We will use the same code to authenticate us:
1 | import tweepy |
Also, we will be sending each query with Cursor:
1 | secrets = { |
We need to run a Cursor query per title. We will insert the request into a loop that iterates the list that our getTitles()
function returns:
1 | titles = getTitles() |
Twitter queries may return thousands of tweets. As we are planning to execute almost 20 thousand queries (one per title) and we will then have to take care of the results, we indicate 100
in the items()
function. That limits each query to the first 100 results.
When we are retrieving big amounts of tweets, we want to bear in mind that we might end up with repeated tweets because of retweeting. To avoid overloading our corpus with the same tweets, we will create a list (tweet_cache
) with the IDs that we already have in our corpus. That way, we will skip duplicated items. Finally, we just have to write the results of each query. This time we will store a JSONL file. This format is common among tagging softwares, so we will be able to import the corpus easily. We only have to write a line with each JSON object and split them with a linebreak (\n
). In order to convert JSON objects to writable strings, we will use JSON’s dump()
function.
1 | from json import dump |
The execution might take several hours. If your computer doesn’t have enough power to handle such amount of work, remember you can change how many pages of titles you want to go through. To make the process lighter, you can jut change for page in range(1, 500)
for for page in range(1, 100)
or for page in range(1, 50)
. That will considerably lessen the amount of titles. Also you can set a lower amount of tweets to retrieve per title. Just change tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(100)
for tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(50)
or even tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(10)
.
The output file will look like this:
1 | {"text": "'TOP Pel\u00edculas 13/11/2020\\n1 Bob Esponja: Un h\u00e9roe al rescate =\\n2 La navidad m\u00e1gica de los Jangle \u2b05\ufe0f\\n3 Operaci\u00f3n feliz navidad \u2b07\ufe0f1\\n4 El parque m\u00e1gico \u2b06\ufe0f2\\n5 Tammy \u2b07\ufe0f1\\n6 M\u00e1s all\u00e1 de la luna \u2b06\ufe0f3\\n7 Amor de calendario \u2b06\ufe0f1\\n8 La vida por delante \u2b05\ufe0f\\n9 La vida que quer\u00edamos \u2b07\ufe0f6\\n10 Bronx \u2b07\ufe0f5 https://t.co/BU6yz8PAVF'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327604891041337349}} |
Conclusion
In this wiki we have put into practice some concepts about data collection for corpus linguistics that we had learned in previous posts. We have query an API for popular films and TV shows and stored their titles. Then, we have searched for tweets containing those titles and with the results we have concoct a JSONL corpus that will serve to create a language model for NER.
Check out the next wiki (Tagging a linguistic corpus) to learn how to easily cure and tag the corpus.
Check out the full script on Github.