Deep Bidirectional LSTM based RNN for Casual Speech to Clear Speech conversion

Singh, S.

Location: Graduate College of the University of Illinois at Chicago
PDF: shiwangisinghresearchprojectfinal-2.pdf

Caption: Male(top), Female(middle), Output(bottom) of the synthesized speech of the ARCTI Ccorpus
Credit: Shiwangi Singh, UIC/EVL

Clear Speech consists of crisp, loud, and slowly spoken sentence(s) for increased intelligibility where the utterances are distinguishable and audible without much strain on the human ear. For individuals with auditory comprehension disabilities, clear speech would be easier to comprehend than casual or normal speech. In this effort, we try to automate the conversion of casual speech to clear speech using machine translation.

In the current work we develop a Recurrent Neural Network (RNN) that uses parallel corpora with one set of corpus as input and predicts output by the parameters learned from another corpus for this translation. Basing our study from literature, we make use of Deep Bidirectional Long Short term memory (LSTM) based RNN model because of its proven effectiveness in speech translation domain. To generalize, we convert voice A to voice B by passing voice A through multiple forward-backward LSTM exploiting its capability to learn voice B from both past and future time steps in voice A trained to regress voice B via backpropagation through time.

We performed our preliminary experiments on a male to female speech conversion task, following which we were able to extend the developed pipeline for casual to clear speech conversion.