Demos

This page collects all the demos I have been part of implementing during my time in academia and industry.

Automatic dubbing and lip-sync

Language is a key barrier to content consumption by the masses. If we can build technologies that can automatically translate the content from one language to another, then we can enable high-quality educational and entertainment content to be accessible across the globe. One of the approaches to do this is to

  • First, use a speech-to-text (ASR) model in the source language that generates the text output. You could use any of the publicly available ASR APIs here.

  • Use a source-to-target language translator. In order, to improve the translation, preprocess the text to identify phrases/sentences from the ASR output before translating. You could use any of the public translation APIs here.

  • Thereafter, use a prosodic text-to-speech (TTS) model to generate high-quality expressive speech. Alternatively, you could use the publicly available TTS APIs.

  • Finally, we can use lip-sync to match the lip movements of the source speaker to the target language. This could be done using methods such as Wav2Lip.

Checkout an example of automatic dubbing with lip-sync below.

Original content in English

Automatically dubbed to Hindi

Automatically dubbed to Hindi with lip-sync

Audio Captioning

Audio captioning is the task of summarizing the acoustic content of an audio recording using grammatical text. This was the brain child of my colleague Konstantinos Drossos with whom we formulated the task and proposed an initial approach for it at WASPAA 2017. The results of this approach are presented in this demo page. Thereafter kostas formalized this task as a research challenge in DCASE 2020.

Automatic Singing Voice Detection in Polyphonic Audio

During my time at SensiBol, where we were building a singing evaluation platform. I was involved in developing an automatic singing voice detection module as a pre-processing step. The experience of using this module, and its performance is demonstrated in this video.

Real-time voice effects

During my time at SensiBol, I implemented some basic PSOLA, LPC and reverberation based voice effects. The examples of these effects are given below and can also be heard at SensiBol website. I further implemented the real-time version of these effects which had a latency of under 20ms and ported it into the MikeL app (it was available for both android and apple devices, however it has been taken down currently). A quick demo video of the app can be found here.

Slide3_Effects_Plain.mp3

User singing recording without any effects

Slide3_Effects_helium.mp3

User singing with helium-gas effect (LPC based)

Slide3_Effects_chipmunk.mp3

User singing with chipmunk effect (PSOLA based)

Slide3_Effects_reverb.mp3

User singing with reverb effect

Singing correction

During my time at SensiBol, I implemented a singing correction algorithm, which would automatically correct the user's singing both with respect to time and pitch, we referred to this effect as singing correction. This was implemented using the PSOLA algorithm. The time correction was implemented by fixing the user singing onsets with the groundtruth onsets. During the pitch correction step, if we correct the pitch to the nearest singing note then the correction sounds very artificial, this kind of effect is also referred as auto-tune. However, if we correct the pitch to the original singers pitch contour the correction sounds very natural. Check the examples below or at SensiBol website.

Slide3_SingingCorrection_Original.mp3

Original singer

Slide3_SingingCorrection_UserAutoTune.mp3

User recording: auto-tuned

Slide3_SingingCorrection_UserPlain.mp3

User recording

Slide3_SingingCorrection_UserSingingCorrection.mp3

User recording: singing correction

Automatic speech recognition on embedded devices in Hindi and English

During my time at SensiBol, I was mainly involved in developing speech solutions. As part of this work I implemented a WFST based GMM-HMM speech decoder from scratch to enable porting and optimization of ASR engine for embedded devices. The decoder was successfully tuned and used for multiple tasks listed below.

Hindi wake-up-word and voice-command detection

The mobile is actively hearing for the wake-up-word -'bol SensiBol'. As soon as it hears the wake-up-word, it expects a voice-command. In this video example, we are controlling the playback of a music player using voice-commands in Hindi.

Lyric-based scoring of singing

Rate the user singing performance based on lyrics pronunciation.

Spoken language evaluation

Rate childrens pronounciation skills in a story reciting app. Give feedback at word level.

Voice-based multiple choice questionnaire (MCQ)

An MCQ is presented to the user, where the user's answers are graded as either correct, wrong, or out-of-vocabulary if the user utters something out of the given choices.


NOTE: The applications and interfaces for all the work done at SensiBol have been done by my awesome colleagues. I only claim to have implemented the audio DSP modules behind it.