Google AudioPaLM can translate texts with your voice


With AudioPaLM, Google is adding audio capabilities to its large PaLM-2 language model. This enables spoken translations with the original speaker’s voice.

With AudioPaLM, Google combines the large language model PaLM-2, which was introduced in May, with its generative audio model AudioLM in a central multimodal architecture. The system can process and generate text and speech, and can be used for speech recognition or to generate translations with original voices.

AudioPaLM’s architecture. | Picture: Google

Babelfish gets closer

The latter feature is particularly noteworthy, as it allows a person to speak in multiple languages ​​simultaneously, as the following demo shows.

Conditioning to the original voice requires only a three-second sample, passed as an audio and SoundStream token. If the audio file is shorter, it is repeated until the three seconds are reached.


AudioPaLM demo. | Video: Google

By integrating AudioLM, AudioPaLM can produce high-quality audio with long-term consistency. This includes the ability to produce semantically plausible speech continuations while preserving speaker identity and prosody for speakers not seen during training.

The model can also perform zero-shot speech-to-text translations for many languages, including speech combinations not encountered during training. This capability can be important for real-world applications such as real-time multilingual communication.

AudioPaLM can also preserve paralinguistic information such as speaker identity and intonation, which is often lost in traditional speech-to-text translation systems. The system is expected to outperform existing solutions in terms of speech quality, based on automatic and human evaluation.

In addition to speech generation, AudioPaLM can also generate transcripts, either in the original language or directly as a translation, or generate speech in the original source. AudioPaLM has achieved top results in speech translation benchmarks and has demonstrated competitive performance in speech recognition tasks.


project page on Github.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top