Sound Event Localization and Tracking

Sound event localization or acoustic source localization is a relative term, most often you are identifying the location of a sound event with respect to the microphone that is recording the sound scene. For a complete localization of a sound source in 3-dimensions, you will have to identify three components - the horizontal azimuth angle φ, the vertical elevation angle θ and the distance r from the microphone. For example, in the figure above, the origin of the 3D space is where the microphone exists, and with respect to the microphone, the bird is located at (distance, elevation, azimuth) of (r, θ, φ). Similarly, you can see that the man and the car are located at a different distance, elevation and azimuth angles. However, most localization-based applications require just the azimuth and elevation angles and this is also referred to as direction of arrival estimation.

Point vs diffuse sources: A point source is a sound source whose spatial location can be identified as a single point in space, else if the spatial location is a wider region in space then it is referred to as a diffuse source. For example, most sound events in the real-life scenario are point sources, such as a human speaker in front of you, or a bird calling from a tree. However imagine you are right in front of an idling-car, the car becomes a diffuse source because the sound is no more coming from a single point but a wide region in space. But the same car if it is at a distance from you, can be treated as a point source. So it really is a relative term, with respect to the microphone or the person-hearing.

Direction of arrival (DOA) estimation

Some popular DOA estimators are based on time-difference-of-arrival (TDOA), steered-response-power (SRP), multiple signal classification (MUSIC) and the estimation of signal parameters via the rotational invariance technique (ESPRIT). These methods vary in terms of algorithmic complexity, compatibility with different microphone array structures, and model assumptions based on the acoustic scenario. The subspace methods such as ESPRIT and MUSIC are generic to array structures and produce high-resolution DOA estimates. However, these subspace methods require a good estimate of the number of active sources to estimate their corresponding DOAs, and this information is not always available. Furthermore, their performance in low SNR and reverberant scenarios is poor. An example of the DOA estimation using MUSIC algorithm is visualized in the video above.

To overcome some of the above drawbacks more recent methods have been studying deep learning or deep neural network (DNN) based DOA estimation. Implementing the DOA estimation using deep learning will additionally enable the integration of DOA estimation into end-to-end sound analysis and detection systems. These DNN based methods have shown to perform equally or better than the non-DNN-based methods mentioned above in reverberant scenarios. Further, the main advantage of DNN-based methods is their ability to learn the number of active sources directly from the data. On the other hand, DNN-based methods, unlike the non-DNN-based methods, require sufficient training data.

Deep learning (DNN) based DOA estimation

There are two broad approaches for DNN-based DOA estimation - classification and regression.

DOA estimation as a classification task: in this task you quantize the space around you, categorize them as different classes and estimate the probability of having an active source in each of these quantized space. Check the illustration above, the azimuth space is quantized into eight directions (0 to 7). The sound sources - car and man, are active at directions 5 and 7, which corresponds to 225º and 315º respectively. The output format for the classification task of this scene would be represented using one-hot labels as shown, with ones for directions 5 and 7 and zeros for the remaining directions. Based on this approach, we proposed one of the earliest DNN-based methods, DOAnet, that could localize sound events in complete azimuth and elevation angles. As the training data, we used naive magnitude and phase spectrograms extracted from each channel of the multichannel audio, and one-hot labels similar to the above illustration, but at 10º resolution (resulting in 36 quantized spaces in azimuth, and 18 in elevation). As the DNN-model we employed a convolutional recurrent neural network, which was trained as a multiclass multilabel classification task with a binary cross-entropy loss. The video below visualizes the performance of this model. The person in this video (not visible) is walking around the microphone and is informing his relative position with respect to the microphone.

Some of the drawbacks of the classification based DOA estimation approach are as follows. The localization performance for unseen DOAs is unknown. The DNN output dimension increases with higher resolution. For example, if we wish to have a 1º resolution in azimuth, then we have 360 dimension DNN output. And if we wish to have 1º resolution in both azimuth and elevation, then we have 360 x 180 = 64800 dimension DNN output. Training such a large output dimension DNN is challenging. For the model to learn each location, we will need sufficient examples for each of these output locations, hence dataset size increases rapidly with higher spatial resolution. Finally, given that the number of active sources at any given time might only be in the range of 0 to 5. The corresponding number of negative examples in the output (of say 360 dimensions of azimuth) will be significantly larger than the number of positive examples, this will result in challenges related to imbalanced dataset.

DOA estimation as a regression task: in this task, you will directly estimate the DOA values as shown in the illustration above. The acoustic model in the illustration is estimating up to three azimuth values using three regressor output, this output dimension is chosen based on the maximum number of sources you want to localize at a given time. Further, since the number of sound events at any given time is varying, you will have to train the model to generate a default output ('def' in the illustration) when there are fewer sources active than the maximum supported sources. This default value is generally a number in the same range as your output, for example in the illustration since we are only estimating azimuth angles, which are of the range 1º to 360º, you would choose the default value to be 400º. Finally, you would train this model using the mean squared error (MSE) loss between the reference and the prediction. One catch in this approach is that you will have to handle the scenario where the regressor outputs can be shuffled. For example, in the same illustration above if the model estimates are [222, 318, def], this is still the correct estimates, but are shuffled with respect to the reference. There have been a couple of DOA estimation methods proposed as a regression task, however, they have only studied the localization of one source at a time (paper 1, paper 2). More recently, we proposed the SELDnet method to estimate DOAs of multiple sources using the regression approach however this method was also predicting the sound event classes, we will discuss more about this method in the next section of sound event localization and detection and tracking.

The main advantage of using a regression-based approach for DOA estimation is that you can get high-resolution continuous DOA estimation, and further estimate unseen DOAs during inference. Both these advantages with a smaller DNN output size in comparison to the classification approach discussed above. However, the maximum number of simultaneous DOAs you can estimate is limited to the number of regressors in the output layer. For example in the above illustration, if you had four active sources, your method will only estimate three of the sources.

Sound event localization and tracking

Sound events in real-life are not always stationary, and are known to move around with varying speeds across the space. Hence, in addition to localizing them at a given time instant, you will have to additionally track their location with respect to time, this is referred to as sound event localization and tracking or simply sound event tracking. Kalman and particle filtering are the most popular methods for such tracking.

The general framework for tracking, is to first estimate the DOAs of all the active sources for each time-frame using your favorite DOA estimation algorithm. This frame-wise DOA estimates are the input to the Kalman or particle filter methods which then forms the association between DOAs in neighbouring frames and gives us the different sound event trajectories. The outputs of this tracking framework is visualized in the bottom-most subplot (MUSIC + Particle filter) of the figure below. The green colored lines are the reference sound event trajectories along azimuth angle. The blue cross marks are the frame-wise DOA estimations of MUSIC algorithm. For this example, we are using the number of sources information required by MUSIC to estimate DOAs directly from the dataset reference. These frame-wise DOAs are then processed by a particle filter method to provide the final sound event trajectory estimates, shown in red.

The basic idea of these trackers is to use the current DOA position, and accumulated knowledge of previous DOA positions to estimate the future DOA position. The implementation of one such algorithm for tracking multiple, unknown number of sound sources using particle filter is made publicly available here. Specifically, this repository presents how to use the Rao-Blackwellized particle filtering for tracking unknown number of 2D targets proposed by Simo Särkkä et. al., the original documentation for the method can be read here.

Deep learning (DNN) based sound event tracking

The basic idea behind the Kalman and particle filters is identical to that of recurrent neural networks, i.e., the current DOA output is influenced by both current and previous DOA inputs. So recurrent layers such as gated recurrent units (GRU) or long short-term memory (LSTM) are an ideal DNN-based replacement for these traditional trackers. If you want to read a more theoretical paper on the relationship between the traditional trackers and RNNs you can refer to this article.

So ideally, any DNN-based DOA estimator that has recurrent layers as part of its architecture should easily be able to track the sound events when trained with dynamic scene datasets (consisting of moving sources). For instance, we trained the regression-based DOA estimation method SELDnet that consists of the recurrent layers with a dynamic scene dataset, and the results are as shown in the figure above. We can observe that the tracking performance between the center sub-plot showing the results of SELDnet and the bottom sub-plot showing the results of the traditional particle-filter based tracking are comparable. The green lines in the illustration are the reference trajectories, and red lines are the estimated trajectories. However, the key difference is that the SELDnet infers the number of active sources directly from the input acoustic features and does not use the number of active sources knowledge that the MUSIC+particle filter approach employs. The implementation of SELDnet and the corresponding dynamic scene datasets are made publicly available here.

Further, both the Kalman and particle filters require the specific knowledge of the sound scene such as the spatial distribution of sound events, their respective velocity ranges when active, and their probability of birth and death. Such concepts are not explicitly modeled in the recurrent layers, rather they learn equivalent information directly from the input acoustic features and corresponding target outputs in the development dataset. In fact, recurrent layers have been shown to work as generic trackers that can learn temporal associations of the target source from any sequential input features. Unlike the particle filters that only work with conceptual representations such as frame-wise multiple DOAs for tracking, the recurrent layers work seamlessly with both conceptual and acoustic features. Further, the recurrent layers do not need complicated task-specific tracker- or feature-engineering that is required by the traditional trackers.