Microsoft has created a cutting edge text-to-speech AI model called VALL-E that can replicate a speaker's voice from a 3 second audio sample and mimic the speaker's tone as well!
The technology was trained on 60,000 hours of English language speech from over 7,000 speakers and is based on Meta's AI-powered compression neural net Encode. Microsoft has made a large number of audio samples available on GitHub to demonstrate the power of this AI tool.
While the company plans to further improve the model's performance by adding more data, they have chosen not to release the code as open-source to prevent any potential misuse, such as voice identification spoofing or impersonation.