Experimental linguistics with Python (II)

Handling corpus files massively

Posted by Mario Casado on 2020-10-21
Experimental linguistics

Introduction

When carrying out experimental research on linguistics, we often handle audio corpora, which involves managing hundreds of audio files, whose name is, luckily, a timestamp and a code. Under other circumstances, we would keep those filenames (in the end, they are unique) and generate a database with the pieces of information linked to those unique names per file. However, linguistic research over audio files is usually done with Praat, whose scripting system is not intended to cope with data structures. There aren’t such types as lists and objects, which makes impossible to count on any external database support. We will only have the information provided by the files.

The issue that arises from this situation is clear: Are we meant to rename each file that we will use in Praat with the information one by one? The answer is also clear: No! We have Python to automate the process.

Prerequisites

  • Basic knowledge of Python
  • Basic knowledge of structured data files

Starting point

The idea is that we have been developing some experimental research recording a lot of speakers several times in a row. This usually means that we would have several recordings for each participant. We will examine two possible situations. The optimal starting point would involve that you had kept some kind of tracking over filenames and their information. For example, a TSV file with two columns where one contains the filename and the other a participant ID.

file participant
171211_001.WAV 1
171211_002.WAV 1
171211_003.WAV 1
171212_002.WAV 2
171212_003.WAV 2
171212_004.WAV 2
171212_008.WAV 3
171212_009.WAV 3
171212_010.WAV 3
171213_008.WAV 4
171213_009.WAV 4
171213_009.WAV 4
171214_002.WAV 5
171214_003.WAV 5
171214_004.WAV 5

This tracking isn’t always easy to keep, though. Sometimes we can only take some quick hand notes. For that reason, we will also regard a scenario where our starting data is not in the computer.

1 Getting it into format

1.1 Scenario 1: I have the files tracked

In this case, we just need to structure each user’s data. Let’s create a list of dictionaries with some data that might represent variables in our experiment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
participants = {
'1': {
'origin': 'ES',
'group': '2',
'condition': '1'
},
'2': {
'origin': 'ES',
'group': '2',
'condition': '2'
},
'3': {
'origin': 'ES',
'group': '1',
'condition': '1'
},
'4': {
'origin': 'ES',
'group': '2',
'condition': '1'
},
'5': {
'origin': 'ES',
'group': '1',
'condition': '2'
}
}

Now we have to load the TSV file in our script. For that purpose, we will use csv library, which will create a dictionary for each audio file in the list. Each of those dictionaries will contain one key for the file name and another for the linked participant. We will be able to easily iterate through the list of file names matching them against its participant’s information thanks to the participant ID that is in both data structures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from csv import DictReader

with open('recordings.tsv') as t:
recordings_tsv = DictReader(t, delimiter='\t')
for i in recordings_tsv:
audio_info = {
'filename': i['file'].rstrip('.WAV'),
'participant_id': i['participant'],
'participant_origin': participants[i['participant']]['origin'],
'participant_group': participants[i['participant']]['group'],
'participant_cond': participants[i['participant']]['condition']
}
print(f'{"_".join(audio_info.values())}.WAV')


With DictReader class we get a list of dictionaries – one for each row in the TSV with each column as a key-value pair:

1
2
3
4
5
6
[{
'file': '171211_001.WAV',
'participant': '1'
},
...
]

After that, we gather all the info that we want to include in the formatted name within a new dictionary called audio_info. As you can see, we strip the file extension of the file name and access participants data through the ID included in the TSV. Once collected, we concatenate all the values of audio_info separated by a _ and print them to check everything went well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
171211_001_1_ES_2_1.WAV
171211_002_1_ES_2_1.WAV
171211_003_1_ES_2_1.WAV
171212_002_2_ES_2_2.WAV
171212_003_2_ES_2_2.WAV
171212_004_2_ES_2_2.WAV
171212_008_3_ES_1_1.WAV
171212_009_3_ES_1_1.WAV
171212_010_3_ES_1_1.WAV
171213_008_4_ES_2_1.WAV
171213_009_4_ES_2_1.WAV
171213_009_4_ES_2_1.WAV
171214_002_5_ES_1_2.WAV
171214_003_5_ES_1_2.WAV
171214_004_5_ES_1_2.WAV

1.2 Scenario 2: I didn’t do any tracking

There is no such a big problem. First of all, we need to get the list of files that we will be handling. We will be using Python’s os library, whose name stands for Operative System (OS). This library serves as a connection with all that regards the OS where we are working.

1
2
3
from os import listdir

filelist = sorted([x for x in listdir('corpus/') if x.endswith('.WAV')])

Using a Python’s comprehension list and listdir() function, we have iterated through all the files in the path corpus/, returned only those ending in the string .WAV and kept them in a list named filelist. We also have used sorted() function to sort the list and make sure it’s in the right order.

We could have different approaches depending on how the list of files relates to the participants. In our case, as can be seen in the table above, each user has three recordings, so it’s as simple as looping through the list and assigning a new user ID every three rounds. We will be using participants variable from above. Take a look.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
count = 1
participant_id = 1
for wav in filelist:
if count == 4:
participant_id += 1
count = 1
audio_info = {
'filename': wav.rstrip('.WAV'),
'participant_id': str(participant_id),
'participant_origin': participants[str(participant_id)]['origin'],
'participant_group': participants[str(participant_id)]['group'],
'participant_cond': participants[str(participant_id)]['condition']
}
print(f'{"_".join(audio_info.values())}.WAV')
count += 1

We are looping through the list of files and every fourth time we increment participant_id and restart the count to track another three rounds. It is also important to remember that we need an increment in the count at the end of the loop, otherwise this variable will never change its starting value. Note that this time our participant_id is an integer type, as we need to increment its value each round. That’s why we need to turn it into a string when we use it to access participants info. Remember the ID is a string in participants variable.

This is the output that we get. The same result as before.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
171211_001_1_ES_2_1.WAV
171211_002_1_ES_2_1.WAV
171211_003_1_ES_2_1.WAV
171212_002_2_ES_2_2.WAV
171212_003_2_ES_2_2.WAV
171212_004_2_ES_2_2.WAV
171212_008_3_ES_1_1.WAV
171212_009_3_ES_1_1.WAV
171212_010_3_ES_1_1.WAV
171213_008_4_ES_2_1.WAV
171213_009_4_ES_2_1.WAV
171213_009_4_ES_2_1.WAV
171214_002_5_ES_1_2.WAV
171214_003_5_ES_1_2.WAV
171214_004_5_ES_1_2.WAV

2 Changing file names

Now for the final part let’s make a copy of the files with the new name. We will use copy2() function from shutil library, but before, head to your files browser and create a new folder in the project called format_files/. Then, we just have to change print() function for the following one.

1
2
3
from shutil import copy2

copy2(f'{audio_info["filename"]}.WAV', f'format_files/{"_".join(audio_info.values())}.WAV')

This function takes as its first argument the origin path and the destination as the second. To change the name we just have to set the new name in the destination route.

You can check the final script on GitHub.

Conclusion

In this wiki we have learnt how to format files massively in order to process their data with applications that don’t allow the use of regular databases. One such case is Praat, which we use very often when carrying out acoustic research. With the short script that we have developed we will be able to format a big corpus with very little effort.