Data collection (II)

Creating a corpus of tweets with Python

Posted by Mario Casado on 2020-11-08
Corpus linguistics

Introduction

Creating a corpus can be a tedious task: collecting every single piece of data, fitting it together, making sure everything is in the right format with the right structure… Python is a great ally for this purpose. It makes possible to collect thousands of data in seconds and at the same time shape it the way it fits better our needs.

Prerequisites

  • Basic knowledge of Python

1 Twitter

1.1 Apply for a developer account

Follow this tutorial to apply for a developer account: How to apply for a Twitter Developer account.

1.2 Create an app and generate keys

Once your developer account has been confirmed, you will be asked for a unique name for your first app. Complete it and click Get keys to generate the batch of access keys. You will be shown your API key, API secret key and Bearer token. Click Skip to dashboard to access your developer dashboard.

Click on the gear next to your app’s name. Look for the section App permissions and click Edit button.

Make sure Read and Write permissions are checked and click Save.

Now, we need regenerate our keys to make sure new permissions are granted. Look for the tab Keys and tokens at the top of the page under the app name. Regenerate API Key & Secret. You will be prompted with a new pair of keys. Write them down and keep them. Generate and keep also a pair of Access Token & Secret.

Don’t forget to write down and keep all the keys.

With the four keys, we will be able to connect to Twitter from our Python script.

2 Python

To easily access Twitter’s different options, we will be using Python’s Tweepy library. First of all, we need install the library on our system. A virtual environment is highly suggested. To install Tweepy, run $ pip install tweepy on a terminal.

2.1 Authentication

In order to connect to our app and fully use Twitter, we have to demonstrate somehow that it’s us. We will have to prove that we are allowed to use the app. The four keys will serve that purpose. Tweepy provides an authentication handler to which we will hand the keys. Tweepy will return an object with which we will be able access Twitter. We will create a function that does all that:

1
2
3
4
5
6
7
import tweepy

def oauth(secrets):
auth = tweepy.OAuthHandler(secrets["api_key"], secrets["api_key_secret"])
auth.set_access_token(secrets["access_token"], secrets["access_token_secret"])
api = tweepy.API(auth, wait_on_rate_limit=False, wait_on_rate_limit_notify=True)
return api

We define a funcion called oauth to which we will pass a dictionary with our keys. We use Tweepy’s OAuthHandler object. We initialize it with our API key and secret and then load the access token and its secret to the object returned. At this point, auth variable is a Tweepy’s authentication handler that has the four keys in it.

To access Twitter, we need create a Tweepy’s API instance. Observe how we initialize it passing the auth variable. This way, when we query Twitter, auth object will pass our keys to the API and we will be granted access. We also specify that we don’t want the script to wait whenever we reach the limit of tweets to retrieve (wait_on_rate_limit) but we do want to be notified when this happens (wait_on_rate_limit_notify). This means that, whenever we get to the limit, the script will stop execution and will show us a message informing that Twitter stopped returning tweets.

2.2 Testing the connection

It’s highly recommended to test that keys and connection are working properly. The fastest way to do it is to send a test tweet. In order to do that, we just have to create the dictionary with the keys and use update_status() function to send a text as a tweet:

1
2
3
4
5
6
7
8
secrets = {
"api_key": "<API_KEY>",
"api_key_secret": "<API_KEY_SECRET>",
"access_token": "<ACCESS_TOKEN>",
"access_token_secret": "<ACCESS_TOKEN_SECRET>"
}
api = oauth(secrets)
api.update_status('Hello, world!')

Don’t forget to replace the placeholders with your keys. Observe that we use the oauth() function that we defined before. That creates the API object that will grants us access to the Twitter app. To send the tweet, we just pass the text we want to post to the update_status() function. Go to your developer account’s timeline. The tweet should have been posted there.

2.3 Querying tweets

Basic API search function is very limited. As we want to retrieve a considerable amount of tweets for our corpus, we will be using Tweepy’s Cursor, which allows to make iterative searches and retrieve them in a single ordered list. We will use the following sentence to search tweets containing the exact string desde que:

1
tweets = tweepy.Cursor(api.search, q='"desde que"', lang='es', tweet_mode="extended").items()

As you can see, we just have to indicate that we want to use Cursor with the API’s search() function, the query string (q), the filter for Spanish tweets (lang) and enable the extended mode, which will return full 240 character tweets. As we want to match the exact keyword sequence and not tweets containing any of the words, it is very important to include the double quotation. Finally, we apply items() function to retrieve Cursor results as an iterator.

2.4 Writing the tweets

Finally, we just have to iterate the list of tweets, pick the data that we are interested in and write them to a file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
with open("tweets.tsv", "w") as w:
w.write("tweet_id\tdate\ttext\tuser_id\n")
for tweet in tweets:
status = tweet._json
if "retweeted_status" in status:
status = status["retweeted_status"]
w.write(
"{tweet_id}\t{date}\t{text}\t{user_id}\n".format(
tweet_id=status["id"],
date=status["created_at"],
text=repr(status["full_text"]),
user_id=status["user"]["id_str"]
)
)

We write a TSV file as they can be handled with Excel. Before starting the loop through the iterator of tweets, we write the headers to the file. For each tweet, we keep the _json attribute, which contains all the tweet info structured as a dictionary. Then, we check whether it is a retweeted_status. If so, we keep only the original tweet as retweets don’t often add any interesting information. At that point, we already have the proper tweet within status variable, so we just write the data to the TSV file. We are writing the tweet ID to have a unique identification of each element, the date when the tweet was posted, the text of the tweet and the user ID (not the name or de nickname) to preserve user’s anonimity.

The result is a file similar to this:

Conclusion

In this tutorial we have learned how to search for tweets containing specific strings of text and how to create a corpus with them. For that, we have activated a Twitter developer’s account, created an app to access Twitter API and generated a batch of keys to connect to the app from our computer. We also have worked with Python’s Tweepy library to develop a script that search for all the tweets matching our criteria until it reaches Twitter’s API limit. The result is a TSV file with the structured data.

You can check the final script on Github.