Google has announced Translatotron, an “experimental new system” that it says will translate speech directly into speech, removing the need for any text.
“Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language,” a Google AI blog post on Wednesday said.
Google said there are three stages of today’s translation systems: Automatic speech recognition, which transcribes speech as text; machine translation, which translates this text into another language; and lastly text-to-speech synthesis, which uses this text to generate speech.
Cascading these steps led to services like Google Translate, but the tech giant now says it will use a single model without the need for text.
“Dubbed Translatotron, this system avoids dividing the task into separate stages,” the blog post by Google AI software engineers Ye Jia and Ron Weiss said.
This will mean faster translation speed and less compounding errors, according to Google.
The system uses spectrograms as input and generates spectrograms, also relying on a neural vocoder and a speaker encoder, meaning the system retains the speaker’s vocal characteristics once translated.