Microsoft’s New AI Bot is Set to Replicate your Voice In Seconds
Microsoft’s new AI bot can listen to a voice for 3 seconds, then mimic that voice
A group of tech giant Microsoft researchers has developed an innovative text-to-speech AI model VALL-E. After getting trained, Microsoft’s new AI bot can almost precisely imitate a person’s voice and the most interesting part is that the team simply needs a three-second audio sample to train this new AI bot from Microsoft.
It is the newest of many AI algorithms that can harness a recording of a person’s voice and make it say words and sentences that person never spoke – and it is remarkable for just how small a scrap of audio it requires to extrapolate an entire human voice. Where 2017’s Lyrebird algorithm from the University of Montreal, for example, requires a full minute of speech to analyze, Microsoft’s new AI bot needs just a three-second audio snippet. Additionally, the researchers claim that once the AI Bot learns a particular voice, it can create audio of that person saying anything while attempting to keep the speaker’s emotional tonality and the context in which they are speaking.
The creators of Microsoft’s VALL-E might work with other generative AI models like GPT-3 to produce content and develop high-quality text-to-speech programs and speech editing that would let a person’s voice recording be edited and altered from a text transcript. Microsoft’s VALL-E is built on a method called EnCodec that Meta unveiled in October 2022. In contrast to traditional text-to-speech systems, which typically synthesize speech by altering waveforms, VALL-E generates discrete audio codec codes from text and acoustic signals.
After doing a vocal analysis, VALL-E converts a person’s voice into tokens. Then it compares the training data with what it “knows” about how that voice would sound if it spoke more words.
The AI has been trained on some 60,000 hours of English speech – mainly, it seems, by audiobook narrators, and the researchers have presented a swag of samples, in which Vall-E attempts to puppeteer a range of human voices. You could find it difficult to distinguish between which was the real voice and which was a synthesis because some do a very good job of capturing the essence of the speech and creating new lines that sound authentic. In other cases, the AI emphasizing words in unexpected places in the phrase is a single red flag.
VALL-E can replicate the “acoustic environment” of the sample audio in addition to keeping the vocal timbre and emotional tone of a speaker.
For instance, in its synthetic output, the audio output will imitate the acoustic and frequency characteristics of a phone call, which is another way of stating that it would sound like a phone call. In addition, Microsoft’s samples (included in the “Synthesis of Diversity” section) show how VALL-E may produce a variety of voice tones by altering the random seed utilized during production. In order to enhance and broaden our knowledge and skills, Microsoft AI Research is building artificial intelligence robots that support human reasoning.