Ankit Tatawat and Tarun Bisht present their research work at Wissap 2023 IIT Kanpur

Ankit Tatawat (MTech) student of IEOR and Tarun Bisht (MS) student of IEOR presented their work (co-authored with Prof Balamurugan Palaniappan ) on " Direct Speech to Speech Translation and Voice Interpolation" at the WISSAP 2023 held at IIT Kanpur from 18-23 Dec,23.

Speech-to-speech translation (S2ST) systems involve translating speech from one language into another language.S2ST is an extension of machine translation (MT), where we translate speech into another language instead of text. A three-way pipeline is very common for S2ST, where source audio is first transcribed into text, and then the text is translated using a text-based MT system, which is usually a sequence-to-sequence (seq2seq) model, and then output translated text is converted back into speech using a text-to-speech system. Direct S2ST systems are seq2seq models that directly convert source speech sequences into target speech sequences without relying on intermediate text generation. The current state-of-the-art direct S2ST system relies on a two-way pipeline where first, raw speech audio is converted into spectrograms, then source speech spectrograms are converted into target speech spectrograms, and then a vocoder is used to convert target spectrograms into speech. Our research focuses on the direct conversion of raw source speech into target speech using an encoder-decoder architecture. Voice preservation is also an active area of research that deals with the preservation of the voice of the source speaker in target speech. In our work, instead of going in the direction of voice preservation, we dynamically change the characteristics of the source voice in the translated speech. We are introducing a voice modulation system in latent space that conditions the generation of translated speech. Interpolation in this latent space will condition the decoder to change the characteristics of the voice in translated speech.

Date Posted