Sound event detection

Sound event detection (SED) is the joint task of recognizing the different sound event classes, and identifying the start and end times for each instance of the sound event. For example, in the above figure, the top subplot shows the time-domain audio signal, and the bottom plot shows the sound event activity present in the audio. We see that there are three sound event classes - speech, car, and bird, active at different times of the audio.

In real-life, sound scenes are quite complex with different combinations of sound events overlapping with each other at the same time. We can observe this overlapping in 'frame t' of the figure above, where all the three sound event classes are active at the same time. The task of SED in such complex sound scenes with overlapping sound events is referred to as polyphonic sound event detection.

Most recent methods for polyphonic SED employ deep learning based approaches that learn to perform SED as a supervised multiclass multilabel classification task. This means that to train these methods we need datasets that provide both audio and corresponding sound event activity annotation as shown in the figure above. The general framework for building SED methods is as follows. Give the single-channel audio from the dataset, we first extract relevant acoustic features of dimension TxF, where T is the number of time-frames and F is the corresponding feature-length. Most often, spectral features such as the mel-band energies are employed as the acoustic features. Thereafter a deep learning method maps these features to a TxC output matrix of the sound event activity, where C is the number of sound event classes and T is the number of input frames. Generally, to support multiclass multilabel classification, the output layer of the deep learning method employs a sigmoid activation function, and the method itself is trained using the binary cross-entropy loss.

At the time of writing this post, convolutional recurrent neural networks (CRNN) were the state of the art recipe for polyphonic SED. This architecture is comprised of stacked convolutional layers, followed by recurrent and fully-connected layers. The motivation for this architecture is that the initial convolutional layers will extract shift-invariant features from the input acoustic features, and further reduce the feature dimension. This is then fed to a recurrent layer that specializes in learning the temporal structure. In the SED scenario, the recurrent layers were shown to improve the performance of the onset-offset detection of sound events. Finally, the fully-connected layers were used as a classification layer. I have published the code for this method here.

More recent methods have been trying to tackle sound event detection using weakly-labeled data. The motivation here is that the collection and annotation of the real-life SED dataset with sound event classes and their corresponding onset-offset times is a tedious task. In comparison, it is much easier to simply listen to the audio and type out the list of active sound event classes without their onset-offset times. This kind of datasets with only the annotation of the list of active sound event classes is referred to as the weakly-labeled data. And the challenge now is to develop methods that can learn to perform SED from these weakly-labeled datasets. One of the largest publicly available weakly-labeled dataset is the Audioset from Google, on which our proposed method for jointly learning to perform SED and sound event classification had obtained the state of the art results at the time of publishing. This work was carried out during my internship at Facebook.

One of the biggest challenges in SED is the overlapping sound events. During my Ph.D. I proposed to overcome this by using multichannel audio for SED instead of the traditional single-channel audio. The motivation for this was the binaural hearing of the human auditory system which can seamlessly detect multiple overlapping sound events. To study this, we first identified that the human auditory system employs the inter-aural intensity difference (IID), the inter-aural time delay (ITD) and perceptual features to detect such overlapping sound events. Based on which we proposed acoustic features that represent similar information as IID, ITD, and perception for binaural audio. Finally, the results proved that using these spatial and perceptual features from binaural audio helps improve the detection of overlapping sound events better than using just the single-channel audio features. The proposed binaural features and SED method achieved the state of the art results in the DCASE 2017 SED task, and the code for the winning method has been made publicly available here.

Motivated by the improvement of overlapping sound events detection with binaural audio, we were curious if using more than two channels of audio would improve it further? To study this, we synthesized identical polyphonic sound scenes with single-channel, binaural and four-channel multichannel audio. We employed the state of the art CRNN based SED method discussed above (code here) to learn the overlapping sound events. The results showed that the overlapping sound events were recognized better with multichannel audio in comparison to using only the single-channel. In this study, we restricted to a maximum of four-channels since most commercially available 360° recording devices had only four-channels.