Computational audio scene analysis (CASA)

CASA is an umbrella term for methods that analyze an acoustic scene using computational methods. Nowadays, people also refer to this domain as Machine listening. The kind of tasks CASA deals with are tasks that a human auditory system can do effortlessly, and the main goal of CASA is to teach machines (or build algorithms) that can carry out the same task.

As a more concrete example, imagine that you are walking in the park, using just your hearing ability (without vision), I am sure you and most humans can easily identify different kind of sound events in the park such as children playing, birds chirping, or the wind blowing. This basic task that our human auditory system can perform effortlessly is also referred to as sound event classification or sound event tagging in the CASA literature. More specifically, sound event classification refers to the task of identifying the most dominant (just one) sound event class given the audio, while the task of identifying all the active sound event classes (can be more than one) in an audio is referred as sound event tagging. Humans can not only identify these sound event classes, but they can also individually detect each of their instances with respect to time, i.e., they can say at what time each of this sound event classes were active. This task of jointly detecting the sound event classes and their corresponding start and end times is also referred as sound event detection or acoustic event detection in the CASA literature.

Before I present more CASA tasks, I would want to clarify some definitions. The naming convention used for labeling the generic sound events (most often) follows the <source><action> format. For example, for the sound event example children playing, the source is children while the action is playing. Similarly, for the birds chirping sound event, birds are the source and chirping is the action. Further, each source can potentially generate a variety of sounds corresponding to its actions, for example, (car horn, car break, car idling) or (human speech, human cough, human clap).

Getting back to CASA tasks now. According to the CASA literature, the human auditory system is a multichannel system, this means it has more than one auditory input (i.e. the two ears). This enables humans to learn the spatial information in the acoustic scene, i.e., how far or close a particular sound event is occurring, or in which direction this sound source is moving. In the CASA literature, this task of locating the direction of the sound event and tracking their location with time is also referred to as sound event localization and tracking. In fact, this spatial information is what enables the human auditory system to partially focus on individual sound events, allowing us to speak with each other even in the presence of other speakers/noises around us. Finally, the overall behaviour of the human auditory system, where it can not only detect the activity of sound events (sound event detection), but also identify their respective locations with time is referred to as sound event localization, detection, and tracking.

The human auditory system can not only break-down the acoustic scene to identify the different sound events, detect their activity, and their spatial locations but can also summarize the overall acoustic content and deduce generic concepts. For example, given that a sound scene has bird calls, children playing and the wind blowing sound events a human can infer that this sound scene can potentially be a park. Similarly, if the scene has sound events of dishes sounds, human chatter and some music we might infer that the sound scene is of a restaurant. As you might have guessed it already, this task is also referred to as acoustic scene classification in the CASA literature. Finally, humans can also summarize the same acoustic scenes using free-flowing grammatical text such as "a group of children playing with occasional bird calls and wind breeze sound" or "people talking and having food with some background music". This task of summarizing the contents of an audio with grammatical text is referred to as audio captioning.

The list of tasks mentioned above is a small subset of CASA tasks, you can check out the webpage of DCASE to get an idea of other tasks (still not an exhaustive list!). More details about my personal research work on some of the CASA tasks introduced above is summarized in the following pages.