Sound event detection (SED) is the task of recognizing the sound events and their respective temporal start and end time in a recording. Sound events in real life do not always occur in isolation, but tend to considerably overlap with each other. Recognizing such overlapping sound events is referred as polyphonic SED. Performing polyphonic SED using monochannel audio is a challenging task. These overlapping sound events can potentially be recognized better with multichannel audio. This repository supports both single- and multichannel versions of polyphonic SED and is referred as SEDnet hereafter.
This method was first proposed in 'Sound event detection using spatial features and convolutional recurrent neural network'. It also won the DCASE 2017 real-life sound event detection task. We are releasing a simple vanila code without much frills here.
Sound event localization, detection, and tracking (SELDT) is the combined task of identifying the temporal onset and offset of a sound event, tracking the spatial location when active, and further associating a textual label describing the sound event. We first presented the SELDnet for static scenes with spatially stationary sources here. Thereafter, we presented the performance of SELDnet on dynamic scenes with sources moving with different angular velocities here. We observed that the recurrent layers are crucial for tracking of sources, and perform comparable tracking as bayesian trackers such as RBMCDA particle filter. We are releasing a simple vanila code without much frills and the related datasets here.
This repository presents how to use the Rao-Blackwellized particle filtering for tracking unknown number of 2D targets proposed by Simo Särkkä et. al.. The original docmumentation for the method can be read here. Specifically, this script is a modified version of the original script by Särkkä et. al., adapted for the real example of tracking unknown/multiple number of sound sources in complete 2D space represented using azimuth and elevation angles, also referred as direction of arrival (DOA) estimation.
This work was used as a baseline to compare the performance of a deep neural network (DNN) for tracking multiple moving sources.
Sound event localization and detection (SELD) is the combined task of identifying the temporal onset and offset of a sound event, tracking the spatial location when active, and further associating a textual label describing the sound event. As part of DCASE 2019, we organized the SELD task with a multi-room reverberant dataset synthesized using real-life impulse response (IR) collected at five different environments. This github page shares the benchmark method, SELDnet, and the dataset for the task. The paper describing the SELDnet can be found on IEEExplore and on Arxiv. The dataset, baseline method and benchmark scores have been described in the task paper available here.
When we developed the initial sound event localization and detection approach, due to the lack of a metric that was jointly measuring the localization and detection performance, we used separate metrics for localization and detection for our evaluation. This was not a fair evaluation of such a method, hence we proposed a set of new metrics for joint measurement of localization and detection. This code repository implements the new metrics and compares the performance with a test set of outputs from the original method.
We extended the SELD task in DCASE 2020 for dynamic scenes, with both moving and stationary sound events. The baseline method for this task differed in the following ways in comparison to the SELDnet method.
Features: The original SELDnet employed naive phase and magnitude components of the spectrogram as the input feature for all input formats of audio. In this baseline method, we use separate features for first-order Ambisonic (FOA) and microphone array (MIC) datasets. As the interaural level difference feature, we employ the 64-band mel energies extracted from each channel of the input audio for both FOA and MIC. To encode the interaural time difference features, we employ intensity vector features for FOA, and generalized cross-correlation features for MIC.
Loss/Objective: The original SELDnet employed mean square error (MSE) for the DOA loss estimation, and this was computed irrespective of the presence or absence of the sound event. In the current baseline, we used a masked-MSE, which computes MSE only when the sound event is active in the reference.
Evaluation metrics: The performance of the original SELDnet was evaluated with stand-alone metrics for detection and localization. Mainly because there was no suitable metric that could jointly evaluate the performance of localization and detection. Since then, we have proposed a new metric that can jointly evaluate the performance, and we employ this new metric for evaluation here.
We extended the SELD task in DCASE 2021 for dynamic scenes with directional interference noise. In comparison to the DCASE2020 baseline that employed multi-task output, we simplified the model architecture by employing the ACCDOA format, which encodes both SED and DOA information in one single output.
We extended the SELD task in DCASE 2022 for real-life recordings. We extended the DCASE2021 baseline to use Multi-ACCDOA which enables the detection of multiple instances of the same class. Additionally, we added support to SALSAlite features that have been shown to improve the SELD performance for microphone array input.