Sound Event Localization, Detection, and Tracking

Demo video of sound event localization, detection, and tracking

The main goals of the Sound event localization, detection, and tracking (SELDT) task is to recognize the sound event classes that are of interest to us, detect the respective start and end times for each instance of the sound event, and further track their spatial location with respect to the time when active. The demo video above visualizes the expected output of a SELDT method. The two key observations in this video are 1) the sound event labels appear only when the sound event is active, and 2) these labels occur around the region where the sound source is located. These two observations correspond to the two sub-tasks in CASA literature 1) sound event detection and 2) sound event localization and tracking, which together fulfill all the SELDT task requirements.

Applications: The SELDT output can automatically describe social and human activities. This description can be used by machines to be context-aware. For instance, robots and humanoids can use SELDT for navigation and natural interaction with surroundings. Smart meeting rooms can recognize the active speaker among other sound events, and track their motion with respect to time. This tracked location of the speaker can be employed to enhance speech using beamforming for teleconferencing or automatic speech recognition applications. According to the World Health Organization, 5 % of the world’s population suffer from hearing disability. With the help of SELDT, we can build assistants that will help these hearing-impaired people to visualize sounds and enable them to interact with the world naturally.

Challenges: In this section, we will only discuss the challenges related to acoustics, and leave the challenges related to recording equipment and hardware for a later day. One of the biggest challenges for SELDT is that the sound events can be ambiguous without context or visuals of the scene. This ambiguity in recognizing sound events is also used to create sound effects for films, and this art is referred to as foley. For example, the sound of thunder can be perceptually reproduced by shaking a thin metal sheet. A catchy saying about this ambiguity is - “A picture speaks a thousand words, while audio speaks a thousand pictures” - Kostas.

The detection of onset-offset times of some sound events can be ambiguous. For example, the sound of a Vehicle passing-by has a relatively longer rise and fall times, and hence marking the onset-offset time for such events is very subjective to the annotator. Another challenge is the intra-class variability of sound events, i.e., not all Car horns or Bird songs sound alike. Given that in real-life scenarios, most sound events are temporally or spatially overlapping with each other, performing SELDT in these scenarios can be challenging. And finally, the usual culprits of all acoustic tasks - noise, reverberation, and low SNR scenarios can also pose challenges.

Data association problem: The data association problem occurs especially when the two sub-tasks of sound event detection and sound event localization and tracking are done separately, for a real-life sound scene with overlapping sound events. Let us take the best-case scenario where the sound event detection method detected two active classes - A and B, and the localization method detected two locations - M and N. Now, how do you associate each of the sound events with its location, or in simple words, did class A occur at location M or N? There is no straight-forward way to perform this association. This problem becomes even more serious if the number of sound events detected and the estimated number of spatial locations are different!

One of the solutions to overcome this problem is to jointly perform the two sub-tasks of sound event detection and sound event localization and tracking. We will discuss more about this using deep learning or deep neural network (DNN) based approaches in the rest of this article. If you are interested in reading a more detailed and technical paper you can check out this journal paper or my Ph.D. thesis.

Deep learning (DNN) based sound event localization, detection, and tracking

The SELDT task can be approached in two ways - a classification-only task or as a joint classification and regression task.

Sound event localization, detection, and tracking as a classification-only task: SELDT can be treated as a multiclass multilabel classification task as shown in the illustration above, with one-hot labels output of CxD dimensions, where C is the number of classes in the dataset, and D is the number of quantized spatial locations. For instance, in the illustration above we have C=2 classes and D=8 azimuth directions, where class A is active at 180º azimuth, and class B is active at two locations 45º and 180º azimuths. The illustration only shows the azimuth angles, however, you can imagine a similar output format for estimating both the azimuth and elevation angles. Such a classification-only network can be trained by employing sigmoid activation for each of the output classes along with binary cross-entropy loss. A similar classification-only SELDT method was proposed here, however, this method only studied the localization of stationary sources and did not perform any tracking studies.

The drawbacks of this approach are identical to the drawbacks of the classification-based localization methods discussed here. The resolution of spatial locations is limited to the trained directions, and the performance for unseen spatial locations is unknown. For a larger number of sound classes (C) and higher spatial resolution (D), the output dimension explodes resulting in skewed/imbalanced dataset problems. This also means that to train such a large dimension method, we will require large datasets with sufficient examples for each class-location pairs.

Sound event localization, detection, and tracking as a joint classification and regression task: In this setup, the sound event detection (SED) is treated as a multiclass multilabel classification task and the localization or the direction of arrival (DOA) estimation is treated as a regression task as shown in the illustration above. The output format of SED is a C-dimensional one-hot label vector, where C is the number of classes in the dataset (C=2 in the above illustration). The regression-based DOA output is a real-valued matrix of dimension GxCx2, where 2 represents the 2-dimensional spatial vector of azimuth and elevation angles. G is the total number of instances of the C sound classes that you want to localize at a given time, for example, in the illustration above, we use G=3. The output in the illustration above represents two instances of class A active at locations (132º, -45º) and (90º, 20º), and one instance of class B active at the location (210º, 5º). Based on this approach, we proposed a SELDT approach - SELDnet which uses G=1 in the DOA output, and the implementation of it is made publicly available here. The output of SELDnet for stationary and dynamic scenes are visualized in the illustration below. All the subplots, in each figure below, represent the same time axis. The left-top sub-plot, in both the figures shows the input feature of SELDnet - spectrogram. The bottom two sub-plots of the left column shows the predictions and reference for the classification task of SED. Where the y-axis represents the different sound event classes, each unique sound class is represented with a single color across the sub-plots. Similarly, the corresponding reference and predicted spatial locations are visualized in the center and right-most columns of each illustration. The localization is performed in 3-dimensional space using Cartesian coordinates and is represented as the distance along x, y, and z axes. We observe that the SELDnet does a good job at sound event localization, detection, and tracking.

The advantage of this approach is that we get high-resolution, continuous, real-valued DOA estimates. This regression-based DOA estimation also enables us to estimate DOA locations unseen in the training data.

Dataset collection methods for sound event localization, detection, and tracking

Learning SELDT methods as a supervised task requires a sufficiently large training dataset. However, collecting and annotating such real-life datasets for the SELDT task is expensive and time-consuming. This dataset collection and annotation becomes even more tedious when it comes to supporting localization in 3-dimensional space, for which you will need to record these datasets using microphones that support 360º capture and annotation tools that allow you to efficiently mark the location trajectories. Hence, we resort to employing realistic simulated datasets and consider collecting and annotating real-life SELDT datasets for future studies. Additionally, using the simulated datasets provides accurate time-boundaries and spatial location trajectories and enables a fair evaluation of the SELDT methods.

The general idea of simulated datasets is to first capture impulse responses of different spatial-trajectories for a fixed microphone position. These are then convolved with isolated sound events sampled from a large dataset with enough inter- and intra-class variability to obtain spatialized sound scene. Finally, we add natural ambiance noise to these recordings at different signal-to-noise ratios. Multiple such SELDT datasets that were collected during my Ph.D. have been made publicly available here. These datasets consist of the following variations

  • Microphone arrays

      • Circular array - eight channels,

      • Tetrahedral array - four channels, and

      • First-order Ambisonic format - four channels.

  • Impulse responses

      • Synthetic impulse responses, and

      • Real-life measured impulse responses

  • Sound scene

      • Stationary sound events

      • Dynamic sound events, with varying speeds

  • Acoustics

      • Anecohic

      • Reverberant

The two video recordings below show the capture of measured impulse response using maximum length sequence (MLS) and natural ambiance recordings. You can read more details about these dataset collection strategies here.

Full azimuth impulse response recording

Natural ambiance recording

Evaluation metrics for sound event localization, detection, and tracking methods

Initially when we started developing methods for SELDT, due to the lack of any SELDT metrics we resorted to using two stand-alone evaluation metrics - one for sound event detection and another for the localization and tracking, which are briefly discussed below


Evaluation metrics for sound event detection (SED)

As the stand-alone SED metric, we employed the established polyphonic SED metrics of the F-score and error rate calculated in segments of one second. For this, we would initially calculate the following intermediate statistics

  • True positives - a sound event being active in both reference and prediction anywhere within the one-second segment.

  • False positives - sound event active in prediction, but absent in the reference.

  • False negatives - sound event active in the reference, but absent in the prediction.

Using these intermediate statistics we estimate the overall F-score and error rate for the entire dataset. More details about these SED metrics can be read here.


Evaluation metrics for sound event localization and tracking

As the stand-alone sound event localization and tracking metric, we could not identify a single established metric similar to sound event detection task. But in general, most localization papers were reporting the two following scores,

  • the recall rate of the DOA estimations, and

  • angular or absolute distance metric between the predicted and reference DOAs.

Since our SELDT studies were tackling localization of overlapping sound events, we made some changes to the above metrics to support our scenario. We proposed to use two scores DOA error and Frame-recall calculated at frame-level to evaluate the localization and tracking performance.

  • DOA error: In the overlapping sound events scenario, the number of DOA estimates of our SELDT method can be different from the actual number of DOA estimates in the reference. Hence, as the DOA error, we compute the least sum angular distance between the estimated and reference set of DOAs using the Hungarian algorithm.

  • Frame recall: We compute the number of frames in the dataset, where the number of DOA predictions were equal to the reference

More details about these DOA metrics can be read here.


Evaluation metrics for sound event localization, detection, and tracking

Apart from the fact that the above two stand-alone metrics are operating at different temporal resolutions (SED evaluated in one-second segments and DOA estimation evaluated at frame-level), they are actually not suited for evaluating the performance of joint detection and localization. We can illustrate this mismatch using the figure above, which shows the predictions and references for a single time-frame. Using the stand-alone SED metrics, we have - one true positive (dog-dog), one false positive (cat), and two false negatives (car horn, child). On the other hand, for the stand-alone localization metrics, since we do not consider the sound labels when calculating the metrics, we have two DOAs in our prediction, and three DOAs in our reference. Hence we first employ the Hungarian algorithm to identify the pairs of DOAs which will result in the least DOA error. These two pairs would be the dog-dog and child-cat. Finally, the frame recall for this individual frame will be zero, because the number of DOA estimates are not equal to the reference DOAs.

We see that neither of the SED or localization metrics is evaluating the joint performance of localization and detection. What we really want our SELDT metric to measure is the DOA error between reference and prediction of the same class. For example, in the above figure, the dog-dog pair is the only true positive according to joint detection and localization theory, and hence we are supposed to compute the DOA error between them. While the remaining predictions are either false positives are false negatives. Based on this logic, we proposed two new SELDT metrics

    • Location-aware detection

      • Mainly for the folks who care only about the detection as long as the localization is under a certain threshold.

      • A detection is considered to be true positive if the detected and reference class are the same, and the spatial distance between them is less than a threshold.

      • Obtain similar intermediate statistics of false positives and negatives.

      • We finally compute the Error rate and F-score, using the above intermediate statistics.

      • Drawbacks

        1. ignores the actual spatial error information

        2. Not suitable for tracking based research


    • Detection-aware localization

      • Mostly for the folks from localization and tracking research, who care about the actual DOA error values.

      • We use identical intermediate statistics as SED but at frame-level. So a true positive is when a sound class is active in both predicted and reference.

      • We report the distance between the predicted and reference DOAs for the above true positive case as DOA error.

      • Obtain similar intermediate statistics of false positives and negatives.

      • As the final metrics, we use the average DOA error of the true positives, and F-score value computed using the intermediate statistics that represent the DOA recall/precision performance.

More about these metrics can be read in our paper here.


Sound event localization and detection task at DCASE

The SELDT task discussed above was formulated as a research challenge at the annual workshop of Detection and Classification of Acoustic Scenes and Events. In the first version of the SELD task organized in DCASE 2019, the challenge was to develop methods that can jointly localize and detect stationary sound events in five different real-life rooms. For this task, the participants were provided Ambisonic and tetrahedral array format audio datasets, and their performance was evaluated using the stand-alone SED and DOA estimation metrics discussed above.

In the second version of the SELD task organized in DCASE 2020, the challenge was to develop methods that can jointly localize, detect and track sound events in a dynamic scene where the sound events are both stationary and moving at varying speeds. For this task, similar to the DCASE 2019 task the participants were provided Ambisonic and tetrahedral array format audio datasets, recorded in 15 different indoor locations. These methods are evaluated using the new SELDT metrics that evaluate the performance of joint detection, localization, and tracking.

In the
third version SELD task in 2021, we extended the 2020 version of the SELD task dataset to make it more realistic. We added directional interferences, ie, sound events that exist in the sound scene but have to be ignored during detection. Further, we added more overlapping sound events of the same class to encourage participants to develop methods that would support this.