Creating a corpus for Named Entity Recognition

Named Entity recognition (I)

Posted by Mario Casado on 2020-11-15
Corpus linguistics, Computational linguistics

Introduction

So far we have learned how to collect different types of data from different sources and for different purposes. In this tutorial, we are going to put it into practice. We will create a specific corpus of tweets containing films and TV shows names that will later serve to concoct a language model for Named Entity Recognition.

Prerequisites

Putting things together

As we already know, Twitter API will only respond to literal string queries of sequences. It doesn’t support regular expressions nor other complex forms of querying. Also, we have to bear in mind that Twitter offers a limited quota of tweets per request. Consequently, we will have to get a sample of titles before running any query and then send a single request per title so we can make the most of each call.

Getting the titles

In order to get relevant titles we will retrieve the current most popular ones via TMDB (The Movie Database) API. To use that API, we just have to sign up and apply for an API key. There are some good amount of endpoints to get different data. We will be using /discover/tv and /discover/movie, which allow to query for TV shows and movies by popularity.

As we already know, HTTP requests in Python are carried out using requests library. We will be sending some filters to the API:

  • To retrieve results in Spanish we will use language filter with the value es-ES.
  • TMDB API returns paginated results. This means we don’t get the whole list in a round but just a page with some of them. If we want more, we need to request the next page. We will use page filter to get each page’s results.
  • As Twitter will stop returning tweets whenever we exceed the free quota and that might happen anywhere in the middle of the list of titles, we will use the filter sort_by with the value popularity.desc to receive results from the most popular to the least.
  • Our API key has to be sent as a filter.

The TMDB API endpoints that we are querying receive data in the URL query (check out URL syntax on Wikipedia), which require some formatting and string encoding. We will use urllib.parse‘s urlencode() function to render our query.

1
2
3
4
5
6
7
8
from urllib.parse import urlencode

query = urlencode({
'api_key': <API_KEY>,
'language': 'es-ES',
'page': page,
'sort_by': 'popularity.desc'
})

Observe how page value is not passed as a string but as a variable. This will allow us to increment it to get all the results in each page. Don’t forget to substitute <API_KEY> with your API key.

We will append the query to the API URL, send a request to both movies and TV shows endpoints and keep the body of the response using the .text method. As the response is JSON structured, we will also use loads() function from json library to turn that JSON into a Python dictionary. That way we will be able to access the content easily:

1
2
3
4
5
6
7
from json import loads
from requests import get

for endpoint in ['movie', 'tv']:
api_response = get(f'https://api.themoviedb.org/3/discover/{endpoint}?{query}').text
results = loads(api_response)['results']

That loop will run twice: one sending the GET request to https://api.themoviedb.org/3/discover/movie and the other to https://api.themoviedb.org/3/discover/tv. TMDB responses don’t handle the results directly. They look as follows.

1
2
3
4
5
6
7
8
{
"page": 1,
"total_results": 10000,
"total_pages": 500,
"results": [
...
]
}

That’s why, after loading JSON (loads()), we keep only results key.

In order to save each result’s title, we can use Python 3’s comprehensive lists. We have to take care of an issue: TMDB stores titles differently. TV shows titles are stored in a key named name and film titles under title. We will have to control this in our code. As there are only two options, we will be using try and except. If title key is found, it’s a film. Otherwise, it’s a TV show. We save all the results in the titles list.

1
2
3
4
5
6
7
titles = list()
try:
film_titles = [result['title'] for result in results]
titles.extend(film_titles)
except:
tv_titles = [result['name'] for result in results]
titles.extend(tv_titles)

We will have to do this as many times as pages we get. As we saw in the API response above, there are 500 pages, so we just have to include all the code in a loop that iterates from 1 to 500. We will create a function that returns the list with all the titles in it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def getTitles():
titles = list()
for endpoint in ['movie', 'tv']:
for page in range(1, 500):
query = urlencode({
'api_key': <API_KEY>,
'language': 'es-ES',
'page': page,
'sort_by': 'popularity.desc'
})
api_response = get(f'https://api.themoviedb.org/3/discover/{endpoint}?{query}').text
results = loads(api_response)['results']
try:
film_titles = [result['title'] for result in results]
titles.extend(film_titles)
except:
tv_titles = [result['name'] for result in results]
titles.extend(tv_titles)
return titles

This function will return around 20 thousand titles.

Getting tweets

To query Twitter for data containing our titles, we will use Tweepy. Remember there are some previous requirements to be met. Check out Creating a corpus of tweets with Python post for further information. We will use the same code to authenticate us:

1
2
3
4
5
6
7
import tweepy

def oauth(secrets):
auth = tweepy.OAuthHandler(secrets["api_key"], secrets["api_key_secret"])
auth.set_access_token(secrets["access_token"], secrets["access_token_secret"])
api = tweepy.API(auth, wait_on_rate_limit=False, wait_on_rate_limit_notify=True)
return api

Also, we will be sending each query with Cursor:

1
2
3
4
5
6
7
8
secrets = {
"api_key": "<API_KEY>",
"api_key_secret": "<API_KEY_SECRET>",
"access_token": "<ACCESS_TOKEN>",
"access_token_secret": "<ACCESS_TOKEN_SECRET>"
}
api = oauth(secrets)
tweets = tweepy.Cursor(api.search, q='"<TITLE>"', lang='es', tweet_mode="extended").items()

We need to run a Cursor query per title. We will insert the request into a loop that iterates the list that our getTitles() function returns:

1
2
3
titles = getTitles()
for title in titles:
tweets = tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(100)

Twitter queries may return thousands of tweets. As we are planning to execute almost 20 thousand queries (one per title) and we will then have to take care of the results, we indicate 100 in the items() function. That limits each query to the first 100 results.

When we are retrieving big amounts of tweets, we want to bear in mind that we might end up with repeated tweets because of retweeting. To avoid overloading our corpus with the same tweets, we will create a list (tweet_cache) with the IDs that we already have in our corpus. That way, we will skip duplicated items. Finally, we just have to write the results of each query. This time we will store a JSONL file. This format is common among tagging softwares, so we will be able to import the corpus easily. We only have to write a line with each JSON object and split them with a linebreak (\n). In order to convert JSON objects to writable strings, we will use JSON’s dump() function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from json import dump

titles = getTitles()
tweet_cache = list()
with open('tweets.jsonl', 'w') as w:
for title in titles:
tweets = tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(100)
for tweet in tweets:
status = tweet._json
if "retweeted_status" in status:
status = status["retweeted_status"]
if status["id"] not in tweet_cache:
data = {
'text': repr(status["full_text"]),
'meta': {
'id': status["id"],
'title': title
}
}
dump(data, w, ensure_ascii=False)
w.write('\n')
tweet_cache.append(status["id"])

The execution might take several hours. If your computer doesn’t have enough power to handle such amount of work, remember you can change how many pages of titles you want to go through. To make the process lighter, you can jut change for page in range(1, 500) for for page in range(1, 100) or for page in range(1, 50). That will considerably lessen the amount of titles. Also you can set a lower amount of tweets to retrieve per title. Just change tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(100) for tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(50) or even tweepy.Cursor(api.search, q=f'"{title}"', lang='es', tweet_mode="extended").items(10).

The output file will look like this:

1
2
3
4
5
6
7
8
9
10
{"text": "'TOP Pel\u00edculas 13/11/2020\\n1 Bob Esponja: Un h\u00e9roe al rescate =\\n2 La navidad m\u00e1gica de los Jangle \u2b05\ufe0f\\n3 Operaci\u00f3n feliz navidad \u2b07\ufe0f1\\n4 El parque m\u00e1gico \u2b06\ufe0f2\\n5 Tammy \u2b07\ufe0f1\\n6 M\u00e1s all\u00e1 de la luna \u2b06\ufe0f3\\n7 Amor de calendario \u2b06\ufe0f1\\n8 La vida por delante \u2b05\ufe0f\\n9 La vida que quer\u00edamos \u2b07\ufe0f6\\n10 Bronx \u2b07\ufe0f5 https://t.co/BU6yz8PAVF'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327604891041337349}}
{"text": "'@llusitrbl bob esponja: un h\u00e9roe al rescate'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327378253850570752}}
{"text": "'@Irenemate Bob Esponja un h\u00e9roe al rescate... \ud83d\ude02\ud83d\ude02\ud83d\ude02\ud83d\ude02\ud83d\ude02\ud83d\ude02'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327342726220804096}}
{"text": "'ENTRADA: Bob Esponja: Un h\u00e9roe al rescate (2020). \\nAudio en espa\u00f1ol de Espa\u00f1a.\\n https://t.co/GxmD79cjuF\\n#NetflixEspa\u00f1a'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327310255315881986}}
{"text": "\"75. Bob Esponja: Un h\u00e9roe al rescate\\nDirector:\\xa0Tim Hill\\nA\u00f1o:\\xa02020\\n\\nMe ha gustado much\u00edsimo (ignorando esa parte del campamento de verano) Son tan idiotas y geniales como los recordaba, aunque un poco m\u00e1s drogados. Ha tenido varias sorpresas que me han alegrado la tarde :') https://t.co/NcNLmbSFQb\"", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327304966688616448}}
{"text": "'\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\\n\\nBob Esponja: Un h\u00e9roe al rescate (2020) https://t.co/nwSaNTyp2d'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327297963996160002}}
{"text": "'Es \"Bob Esponja: un h\u00e9roe al rescate\" la mejor pel\u00edcula que he visto en mucho tiempo? S\u00ed.\\nTengo pruebas y cer\u00edsimo dudas.'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327294582485291009}}
{"text": "'TOP Pel\u00edculas 12/11/2020\\n1 Bob Esponja: Un h\u00e9roe al rescate =\\n2 Operaci\u00f3n feliz navidad =\\n3 La vida que quer\u00edamos \u2b06\ufe0f6\\n4 Tammy \u2b07\ufe0f1\\n5 Bronx \u2b07\ufe0f1\\n6 El parque m\u00e1gico \u2b06\ufe0f1\\n7 Cementerio de animales \u2b07\ufe0f2\\n8Amor de calendario \u2b07\ufe0f2\\n9 M\u00e1s all\u00e1 de la luna \u2b07\ufe0f1\\n10 Centuri\u00f3n = https://t.co/X8qwbNwcDN'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1327218978935148546}}
{"text": "'43- Bob esponja: un H\u00e9roe al rescate (2020): Aparece Keano Reeves. Fin. [7] https://t.co/U4rwk9ehrF'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1326912862841921539}}
{"text": "'La ausencia de grandes estrenos en cines o en plataformas nos ha llevado a tener que escribir sobre #bobesponjaalrescate. Lo siento, pero las cosas son as\u00ed. Aqu\u00ed ten\u00e9is la opini\u00f3n de nuestros redactores. \\n\\n\ud83d\udcfd\ufe0f\ud83c\udfacLINK> https://t.co/dwYqd954JQ https://t.co/LBwigAymGn'", "meta": {"title": "Bob Esponja: Un h\u00e9roe al rescate", "id": 1326887361989533697}}

Conclusion

In this wiki we have put into practice some concepts about data collection for corpus linguistics that we had learned in previous posts. We have query an API for popular films and TV shows and stored their titles. Then, we have searched for tweets containing those titles and with the results we have concoct a JSONL corpus that will serve to create a language model for NER.

Check out the next wiki (Tagging a linguistic corpus) to learn how to easily cure and tag the corpus.

Check out the full script on Github.