Introduction
When carrying out experimental research on linguistics, we often handle audio corpora, which involves managing hundreds of audio files, whose name is, luckily, a timestamp and a code. Under other circumstances, we would keep those filenames (in the end, they are unique) and generate a database with the pieces of information linked to those unique names per file. However, linguistic research over audio files is usually done with Praat, whose scripting system is not intended to cope with data structures. There aren’t such types as lists and objects, which makes impossible to count on any external database support. We will only have the information provided by the files.
The issue that arises from this situation is clear: Are we meant to rename each file that we will use in Praat with the information one by one? The answer is also clear: No! We have Python to automate the process.
Prerequisites
- Basic knowledge of Python
- Basic knowledge of structured data files
Starting point
The idea is that we have been developing some experimental research recording a lot of speakers several times in a row. This usually means that we would have several recordings for each participant. We will examine two possible situations. The optimal starting point would involve that you had kept some kind of tracking over filenames and their information. For example, a TSV file with two columns where one contains the filename and the other a participant ID.
file | participant |
---|---|
171211_001.WAV | 1 |
171211_002.WAV | 1 |
171211_003.WAV | 1 |
171212_002.WAV | 2 |
171212_003.WAV | 2 |
171212_004.WAV | 2 |
171212_008.WAV | 3 |
171212_009.WAV | 3 |
171212_010.WAV | 3 |
171213_008.WAV | 4 |
171213_009.WAV | 4 |
171213_009.WAV | 4 |
171214_002.WAV | 5 |
171214_003.WAV | 5 |
171214_004.WAV | 5 |
This tracking isn’t always easy to keep, though. Sometimes we can only take some quick hand notes. For that reason, we will also regard a scenario where our starting data is not in the computer.
1 Getting it into format
1.1 Scenario 1: I have the files tracked
In this case, we just need to structure each user’s data. Let’s create a list of dictionaries with some data that might represent variables in our experiment.
1 | participants = { |
Now we have to load the TSV file in our script. For that purpose, we will use csv
library, which will create a dictionary for each audio file in the list. Each of those dictionaries will contain one key for the file name and another for the linked participant. We will be able to easily iterate through the list of file names matching them against its participant’s information thanks to the participant ID that is in both data structures.
1 | from csv import DictReader |
With DictReader
class we get a list of dictionaries – one for each row in the TSV with each column as a key-value pair:
1 | [{ |
After that, we gather all the info that we want to include in the formatted name within a new dictionary called audio_info
. As you can see, we strip the file extension of the file name and access participants data through the ID included in the TSV. Once collected, we concatenate all the values of audio_info
separated by a _
and print them to check everything went well:
1 | 171211_001_1_ES_2_1.WAV |
1.2 Scenario 2: I didn’t do any tracking
There is no such a big problem. First of all, we need to get the list of files that we will be handling. We will be using Python’s os
library, whose name stands for Operative System (OS). This library serves as a connection with all that regards the OS where we are working.
1 | from os import listdir |
Using a Python’s comprehension list and listdir()
function, we have iterated through all the files in the path corpus/
, returned only those ending in the string .WAV
and kept them in a list named filelist
. We also have used sorted()
function to sort the list and make sure it’s in the right order.
We could have different approaches depending on how the list of files relates to the participants. In our case, as can be seen in the table above, each user has three recordings, so it’s as simple as looping through the list and assigning a new user ID every three rounds. We will be using participants
variable from above. Take a look.
1 | count = 1 |
We are looping through the list of files and every fourth time we increment participant_id
and restart the count
to track another three rounds. It is also important to remember that we need an increment in the count
at the end of the loop, otherwise this variable will never change its starting value. Note that this time our participant_id
is an integer type, as we need to increment its value each round. That’s why we need to turn it into a string when we use it to access participants info. Remember the ID is a string in participants
variable.
This is the output that we get. The same result as before.
1 | 171211_001_1_ES_2_1.WAV |
2 Changing file names
Now for the final part let’s make a copy of the files with the new name. We will use copy2()
function from shutil
library, but before, head to your files browser and create a new folder in the project called format_files/
. Then, we just have to change print()
function for the following one.
1 | from shutil import copy2 |
This function takes as its first argument the origin path and the destination as the second. To change the name we just have to set the new name in the destination route.
You can check the final script on GitHub.
Conclusion
In this wiki we have learnt how to format files massively in order to process their data with applications that don’t allow the use of regular databases. One such case is Praat, which we use very often when carrying out acoustic research. With the short script that we have developed we will be able to format a big corpus with very little effort.