Demos
This page collects all the demos I have been part of implementing during my time in academia and industry.
Automatic dubbing and lip-sync
Language is a key barrier to content consumption by the masses. If we can build technologies that can automatically translate the content from one language to another, then we can enable high-quality educational and entertainment content to be accessible across the globe. One of the approaches to do this is to
First, use a speech-to-text (ASR) model in the source language that generates the text output. You could use any of the publicly available ASR APIs here.
Use a source-to-target language translator. In order, to improve the translation, preprocess the text to identify phrases/sentences from the ASR output before translating. You could use any of the public translation APIs here.
Thereafter, use a prosodic text-to-speech (TTS) model to generate high-quality expressive speech. Alternatively, you could use the publicly available TTS APIs.
Finally, we can use lip-sync to match the lip movements of the source speaker to the target language. This could be done using methods such as Wav2Lip.
Checkout an example of automatic dubbing with lip-sync below.
Original content in English
Automatically dubbed to Hindi
Automatically dubbed to Hindi with lip-sync
Audio Captioning
Audio captioning is the task of summarizing the acoustic content of an audio recording using grammatical text. This was the brain child of my colleague Konstantinos Drossos with whom we formulated the task and proposed an initial approach for it at WASPAA 2017. The results of this approach are presented in this demo page. Thereafter kostas formalized this task as a research challenge in DCASE 2020.
Automatic Singing Voice Detection in Polyphonic Audio
During my time at SensiBol, where we were building a singing evaluation platform. I was involved in developing an automatic singing voice detection module as a pre-processing step. The experience of using this module, and its performance is demonstrated in this video.
Real-time voice effects
During my time at SensiBol, I implemented some basic PSOLA, LPC and reverberation based voice effects. The examples of these effects are given below and can also be heard at SensiBol website. I further implemented the real-time version of these effects which had a latency of under 20ms and ported it into the MikeL app (it was available for both android and apple devices, however it has been taken down currently). A quick demo video of the app can be found here.
User singing recording without any effects
User singing with helium-gas effect (LPC based)
User singing with chipmunk effect (PSOLA based)
User singing with reverb effect
Singing correction
During my time at SensiBol, I implemented a singing correction algorithm, which would automatically correct the user's singing both with respect to time and pitch, we referred to this effect as singing correction. This was implemented using the PSOLA algorithm. The time correction was implemented by fixing the user singing onsets with the groundtruth onsets. During the pitch correction step, if we correct the pitch to the nearest singing note then the correction sounds very artificial, this kind of effect is also referred as auto-tune. However, if we correct the pitch to the original singers pitch contour the correction sounds very natural. Check the examples below or at SensiBol website.
Original singer
User recording: auto-tuned
User recording
User recording: singing correction
Automatic speech recognition on embedded devices in Hindi and English
During my time at SensiBol, I was mainly involved in developing speech solutions. As part of this work I implemented a WFST based GMM-HMM speech decoder from scratch to enable porting and optimization of ASR engine for embedded devices. The decoder was successfully tuned and used for multiple tasks listed below.
Hindi wake-up-word and voice-command detection
The mobile is actively hearing for the wake-up-word -'bol SensiBol'. As soon as it hears the wake-up-word, it expects a voice-command. In this video example, we are controlling the playback of a music player using voice-commands in Hindi.
Lyric-based scoring of singing
Rate the user singing performance based on lyrics pronunciation.
Spoken language evaluation
Rate childrens pronounciation skills in a story reciting app. Give feedback at word level.
Voice-based multiple choice questionnaire (MCQ)
An MCQ is presented to the user, where the user's answers are graded as either correct, wrong, or out-of-vocabulary if the user utters something out of the given choices.
NOTE: The applications and interfaces for all the work done at SensiBol have been done by my awesome colleagues. I only claim to have implemented the audio DSP modules behind it.