Creepy New AI Can Simulate Your Voice Perfectly After Hearing It for 3 Seconds

A young woman talking to her phone's virtual assistant
Shutterstock / HBRH

Modern technology has revolutionized the way we get things done. Even the most basic version of the smartphones in most peoples' pockets or smart home devices in our living rooms have an impressive amount of capabilities—especially when you consider you can control them simply by talking, thanks to artificial intelligence (AI). But even as computers have progressed to help make our lives easier, they're also entering into new territory as they become able to mimic human behavior and even think for themselves. And now, one new creepy form of AI can simulate your voice perfectly after hearing it for just three seconds. Read on to learn more about the groundbreaking technology.

READ THIS NEXT: Never Charge Your Android Phone This Way, Experts Say.

Microsoft has developed a new type of AI that can flawlessly simulate your voice.

A young woman recording her voice on a computer using a microphone and headphones
Shutterstock / Soloviova Liudmyla

We've all relied on machines to make our daily lives easier in one way or another. But what if a computer could step in and mimic the way you speak without others even noticing?

Last week, researchers at Microsoft announced they had developed a new form of text-to-speech AI they've dubbed VALL-E, Ars Technica reports. The technology can simulate a person's voice by using a three-second audio clip, even picking up and preserving the original speaker's emotional tone and the acoustic sounds of the environment in which they're recording. The team says the model could be handy for creating automatic vocalizations of text—even though it comes with potential risks of highly sophisticated dupes similar to deepfake videos.

The company says the new tech is based on a "neural codec language model."

A man sitting on his computer while talking to his phone's virtual assistant
Shutterstock / fizkes

In its paper discussing the new tech, Microsoft dubs VALL-E a "neural codec language model." What this means is that while traditional text-to-speech (TTS) software takes written words and manipulates waveforms to generate vocalizations, the AI can pick up subtle elements of a voice and specific audio prompts that help it create a reliable recreation of a person speaking any sentence that's fed to it, according to the website Interesting Engineering.

"To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively," the team explains in their paper. "Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder."

RELATED:For more up-to-date information, sign up for our daily newsletter.

The team used over 60,000 hours of recorded speech to train the new AI.

author writing on computer
Michael Julius Photos / Shutterstock

To develop the new model, the team says it used about 60,000 hours of recorded speech in English from more than 7,000 individual speakers from an audio library assembled by Meta known as LibriLight. In most cases, recordings were pulled from readings of public-domain audiobooks stored on LibriVox, Ars Technica reports. In its trials, the team said that VALL-E needs the voice in the three-second sample to closely resemble one of the voices from its training data to produce a convincing result.

The team is now showcasing their work by posting specific examples of the software in action on a GitHub page. Each provides a three-second clip of a speaker's voice reading random text and a "ground truth," which is a recorded example of the speaker reading a sentence to be used for comparison. They then provide a "baseline" recording to show how typical TTS software would generate spoken audio and a "VALL-E" version of the recording for comparison to the previous two.

While the results aren't entirely perfect, they do showcase some very convincing examples where the machine-generated speech sounds shockingly human. The researchers also add that besides mimicking inflection and emotion, the software can also replicate the environment in which the base audio is recorded—for example, making it sound like someone is speaking outdoors, in an echoing room, or on a phone call.

So far, Microsoft hasn't released the program for others to test or experiment with.

hands typing on a laptop
iStock

The research team concludes their paper by saying they plan to increase the amount of training data to help the model improve its speaking styles and become better at mimicking human voice. But for the time being, Microsoft has also held back from making the new software available for developers or the general public to test—potentially because of its ability to trick people or be used for nefarious purposes.

"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker," the authors wrote in their conclusion. "To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."

Zachary Mack Zach is a freelance writer specializing in beer, wine, food, spirits, and travel. He is based in Manhattan.Read moreFiled Under • Read This Next
  • Wine counterfeit, over 40

    The Smart Man's Guide to Spotting Counterfeit Wine

    That $2,000 bottle of Bourdeaux might be worth $20. Here’s how to be sure it’s the real thing.April 13, 2017
  • Friend date, friendship, female friendships, red wines

    The 5 Best Light-Bodied Red Wines for Extending Your Summer

    Don't let September ruin a killer party. August 25, 2017
  • 17 Insanely Cool New Luxury Bicycles for Spring

    Meet the bikes that cost more than a car.March 31, 2017
  • Best Life Essentials

    You deserve the best. Live life to the fullest with these ultimate essentials.October 21, 2016
  • Raw meat

    This Is the Safest Way to Store Meat

    Don't let that filet set you up for food poisoning. February 7, 2018
  • winklevoss twins on instagram

    Here Are the 10 Wealthiest Cryptocurrency Millionaires Under 40

    Including one guy who is only 24 years-old!February 7, 2018
  • Server closing restaurant

    This Is the Safest Way to Clean Your Floors

    Keep your floors clean and looking brand new with this easy tip. February 8, 2018
  • real estate, open house. second date ideas

    15 Things Your Real Estate Agent Won't Tell You

    These are sure to shock even the most intrepid homebuyers. February 14, 2018
  • 20 Financial Decisions You're Guaranteed to Regret

    The experts have spoken: It's time to start saving for your future.March 9, 2018
  • woman and Financial Advisor meeting

    20 Secrets Your Financial Advisor Won't Tell You

    Here are the best ways to keep your wallet as full as possible.March 10, 2018

You Might Also Like