{"id":12756,"date":"2023-01-09T21:17:13","date_gmt":"2023-01-09T21:17:13","guid":{"rendered":"https:\/\/blog.bwgamespot.com\/index.php\/2023\/01\/09\/this-new-ai-can-mimic-human-voices-with-only-3-seconds-of-training\/"},"modified":"2023-01-09T21:17:13","modified_gmt":"2023-01-09T21:17:13","slug":"this-new-ai-can-mimic-human-voices-with-only-3-seconds-of-training","status":"publish","type":"post","link":"https:\/\/blog.bwgamespot.com\/index.php\/2023\/01\/09\/this-new-ai-can-mimic-human-voices-with-only-3-seconds-of-training\/","title":{"rendered":"This new AI can mimic human voices with only 3 seconds of training"},"content":{"rendered":"<p>Humanity has taken yet another step toward the inevitable war against the machines (which we will lose) with the creation of Vall-E, an AI developed by a team of researchers at Microsoft that can produce high quality human voice replications with only a few seconds of audio training.<\/p>\n<p>Vall-E isn&#8217;t the first AI-powered voice tool\u2014<a href=\"https:\/\/www.pcgamer.com\/this-fan-made-skyrim-trailer-uses-entirely-ai-synthesized-voice-acting\/\" target=\"_blank\" rel=\"noopener\">xVASynth<\/a>, for instance, has been kicking around for a couple years now\u2014but it promises to exceed them all in terms of pure capability. In a paper available at <a href=\"https:\/\/arxiv.org\/abs\/2301.02111\" target=\"_blank\" rel=\"noopener\">Cornell University<\/a> (via <a href=\"https:\/\/www.windowscentral.com\/software-apps\/microsofts-vall-e-can-imitate-any-voice-with-just-a-three-second-sample\" target=\"_blank\" rel=\"noopener\">Windows Central<\/a>), the Vall-E researchers say that most current text-to-speech systems are limited by their reliance on &#8220;high-quality clean data&#8221; in order to accurately synthesize high-quality speech.<\/p>\n<p>&#8220;Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation,&#8221; the paper states. &#8220;Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.&#8221;<\/p>\n<p>(&#8220;<a href=\"https:\/\/towardsdatascience.com\/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab\" target=\"_blank\" rel=\"noopener\">Zero-shot scenario<\/a>&#8221; in this case essentially means the ability of the AI to recreate voices without being specifically trained on them.)<\/p>\n<p>Vall-E, on the other hand, is trained with a much larger and more diverse data set: 60,000 hours of English-language speech drawn from more than 7,000 unique speakers, all of it transcribed by speech recognition software. The data being fed to the AI contains &#8220;more noisy speech and inaccurate transcriptions&#8221; than that used by other text-to-speech systems, but researchers believe the sheer scale of the input, and its diversity, make it much more flexible, adaptable, and\u2014this is the big one\u2014natural than its predecessors.<\/p>\n<p>&#8220;Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,&#8221; states the paper, which is filled with numbers, equations, diagrams, and other such complexities. &#8220;In addition, we find VALL-E could preserve the speaker\u2019s emotion and acoustic environment of the acoustic prompt in synthesis.&#8221;<\/p>\n<div class=\"image-full-width-wrapper\">\n<div class=\"image-widthsetter\">\n<p class=\"vanilla-image-block\">\n<\/p><\/div>\n<\/div>\n<p><span class=\"credit\">(Image credit: Vall-E)<\/span><\/p>\n<p>You can actually hear Vall-E in action on <a href=\"https:\/\/valle-demo.github.io\/\" target=\"_blank\" rel=\"noopener\">Github<\/a>, where the research team has shared a brief breakdown of how it all works, along with dozens of samples of inputs and outputs. The quality varies: Some of the voices are notably robotic, while others sound quite human. But as a sort of first-pass tech demo, it&#8217;s impressive. Imagine where this technology will be in a year, or two or five, as systems improve and the voice training dataset expands even further.<\/p>\n<p>Which is of course why it&#8217;s a problem. Dall-E, the AI art generator, is facing pushback over <a href=\"https:\/\/www.pcgamer.com\/ai-generated-images-face-getty-ban-as-privacy-and-ownership-concerns-grow\/\" target=\"_blank\" rel=\"noopener\">privacy and ownership concerns<\/a>, and the ChatGPT bot is convincing enough that it was recently <a href=\"https:\/\/www.pcgamer.com\/nyc-public-schools-have-banned-chatgpt-over-cheating-concerns-while-the-bot-itself-insists-it-is-anti-plagiarism\/\" target=\"_blank\" rel=\"noopener\">banned by the New York City Department of Education<\/a>. Vall-E has the potential to be even more worrying because of the possible use in scam marketing calls or to reinforce deepfake videos. That may sound a bit hand-wringy but as our executive editor Tyler Wilde said at the start of the year, this stuff <a href=\"https:\/\/www.pcgamer.com\/ai-art-isnt-going-away\/\" target=\"_blank\" rel=\"noopener\">isn&#8217;t going away<\/a>, and it&#8217;s vital that we recognize the issues and regulate the creation and use of AI systems before potential problems turn into real (and real big) ones.<\/p>\n<p>The Vall-E research team addressed those &#8220;broader impacts&#8221; in the conclusion of its paper. &#8220;Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,&#8221; the team wrote. &#8220;To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put <a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/responsible-ai?rtc=1&amp;activetab=pivot1%3Aprimaryr6\" target=\"_blank\" rel=\"noopener\">Microsoft AI Principles<\/a> into practice when further developing the models.&#8221;<\/p>\n<p>In case you need further evidence that on-the-fly voice mimicry leads to bad places:<\/p>\n<div class=\"youtube-video\">\n<div class=\"video-aspect-box\"><\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>[#item_image]This new AI can mimic human voices with only 3 seconds of training<!-- wp:html --><\/p>\n<p>Humanity has taken yet another step toward the inevitable war against the machines (which we will lose) with the creation of Vall-E, an AI developed by a team of researchers at Microsoft that can produce high quality human voice replications with only a few seconds of audio training.<\/p>\n<p>Vall-E isn&#8217;t the first AI-powered voice tool\u2014<a href=\"https:\/\/www.pcgamer.com\/this-fan-made-skyrim-trailer-uses-entirely-ai-synthesized-voice-acting\/\" target=\"_blank\" rel=\"noopener\">xVASynth<\/a>, for instance, has been kicking around for a couple years now\u2014but it promises to exceed them all in terms of pure capability. In a paper available at <a href=\"https:\/\/arxiv.org\/abs\/2301.02111\" target=\"_blank\" rel=\"noopener\">Cornell University<\/a> (via <a href=\"https:\/\/www.windowscentral.com\/software-apps\/microsofts-vall-e-can-imitate-any-voice-with-just-a-three-second-sample\" target=\"_blank\" rel=\"noopener\">Windows Central<\/a>), the Vall-E researchers say that most current text-to-speech systems are limited by their reliance on &#8220;high-quality clean data&#8221; in order to accurately synthesize high-quality speech.<\/p>\n<p>&#8220;Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation,&#8221; the paper states. &#8220;Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.&#8221;<\/p>\n<p>(&#8220;<a href=\"https:\/\/towardsdatascience.com\/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab\" target=\"_blank\" rel=\"noopener\">Zero-shot scenario<\/a>&#8221; in this case essentially means the ability of the AI to recreate voices without being specifically trained on them.)<\/p>\n<p>Vall-E, on the other hand, is trained with a much larger and more diverse data set: 60,000 hours of English-language speech drawn from more than 7,000 unique speakers, all of it transcribed by speech recognition software. The data being fed to the AI contains &#8220;more noisy speech and inaccurate transcriptions&#8221; than that used by other text-to-speech systems, but researchers believe the sheer scale of the input, and its diversity, make it much more flexible, adaptable, and\u2014this is the big one\u2014natural than its predecessors.<\/p>\n<p>&#8220;Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,&#8221; states the paper, which is filled with numbers, equations, diagrams, and other such complexities. &#8220;In addition, we find VALL-E could preserve the speaker\u2019s emotion and acoustic environment of the acoustic prompt in synthesis.&#8221;<\/p>\n<div class=\"image-full-width-wrapper\">\n<div class=\"image-widthsetter\">\n<p class=\"vanilla-image-block\">\n<\/div>\n<\/div>\n<p><span class=\"credit\">(Image credit: Vall-E)<\/span><\/p>\n<p>You can actually hear Vall-E in action on <a href=\"https:\/\/valle-demo.github.io\/\" target=\"_blank\" rel=\"noopener\">Github<\/a>, where the research team has shared a brief breakdown of how it all works, along with dozens of samples of inputs and outputs. The quality varies: Some of the voices are notably robotic, while others sound quite human. But as a sort of first-pass tech demo, it&#8217;s impressive. Imagine where this technology will be in a year, or two or five, as systems improve and the voice training dataset expands even further.<\/p>\n<p>Which is of course why it&#8217;s a problem. Dall-E, the AI art generator, is facing pushback over <a href=\"https:\/\/www.pcgamer.com\/ai-generated-images-face-getty-ban-as-privacy-and-ownership-concerns-grow\/\" target=\"_blank\" rel=\"noopener\">privacy and ownership concerns<\/a>, and the ChatGPT bot is convincing enough that it was recently <a href=\"https:\/\/www.pcgamer.com\/nyc-public-schools-have-banned-chatgpt-over-cheating-concerns-while-the-bot-itself-insists-it-is-anti-plagiarism\/\" target=\"_blank\" rel=\"noopener\">banned by the New York City Department of Education<\/a>. Vall-E has the potential to be even more worrying because of the possible use in scam marketing calls or to reinforce deepfake videos. That may sound a bit hand-wringy but as our executive editor Tyler Wilde said at the start of the year, this stuff <a href=\"https:\/\/www.pcgamer.com\/ai-art-isnt-going-away\/\" target=\"_blank\" rel=\"noopener\">isn&#8217;t going away<\/a>, and it&#8217;s vital that we recognize the issues and regulate the creation and use of AI systems before potential problems turn into real (and real big) ones.<\/p>\n<p>The Vall-E research team addressed those &#8220;broader impacts&#8221; in the conclusion of its paper. &#8220;Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,&#8221; the team wrote. &#8220;To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put <a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/responsible-ai?rtc=1&amp;activetab=pivot1%3Aprimaryr6\" target=\"_blank\" rel=\"noopener\">Microsoft AI Principles<\/a> into practice when further developing the models.&#8221;<\/p>\n<p>In case you need further evidence that on-the-fly voice mimicry leads to bad places:<\/p>\n<div class=\"youtube-video\">\n<div class=\"video-aspect-box\"><\/div>\n<\/div>\n<p><!-- \/wp:html --><\/p>\n","protected":false},"author":0,"featured_media":12757,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[20],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/posts\/12756"}],"collection":[{"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/comments?post=12756"}],"version-history":[{"count":0,"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/posts\/12756\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/media\/12757"}],"wp:attachment":[{"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/media?parent=12756"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/categories?post=12756"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.bwgamespot.com\/index.php\/wp-json\/wp\/v2\/tags?post=12756"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}