Tortoise, that AI model that takes up to a minute to perform single-sentence inference, lands. If it seems slow at first sight, it remains however effective and available, which is not the case of the AI model of synthesized voice Vall-E from Microsoft.
Ever since Le Monde Informatique started covering the rise of various AI applications, such as image generation – notably with Stable Horde – code repositories on GitHub and links on Reddit are teeming with AI models. Some of them are actually found on commercial sites, which develop their own algorithms or adapt others that have been published as open source. A great example of an existing audio AI site is Uberduck.ai, which offers literally hundreds of pre-programmed templates. Just enter the text in the field provided for this purpose for an Elon Musk, a Bill Gates, a Daffy Duck, or even a Siri to read the pre-programmed lines.
To train an AI to reproduce speech, you need to upload clear voice samples. The AI learns how the speaker combines sounds with the goal of learning these relationships, perfecting them, and mimicking the results. Normally, putting together a good voice model can take some practice, with long samples to indicate how a particular person is speaking. In recent days, however, something new has appeared: Microsoft Vall-E, enriched by a research paper (with real-life examples) on a synthesized voice that requires only a few seconds of source audio to generate a fully programmable voice . Naturally, researchers and other AI admirers wanted to know if the Vall-E model had ever been made available to the public. The answer is no. In the meantime, it is possible to play with another model if you wish, called Tortoise. (The author specifies that it is called Tortoise because it is slow, which is true, but it works).
The overview of VALL-E. (Credit: VALL-E / Microsoft)
Train your own AI voice with Tortoise
What makes Tortoise interesting is that anyone can train the model on the voice of their choice by simply uploading a few audio clips. The GitHub page for the solution says it takes a few clips of about a dozen seconds. They must then be saved in a .WAV file with a specific quality. How does it work? Thanks to an unknown cloud service: Google Colab (or “Colaboratory”). It allows you to write and run Python code in your browser without any configuration required, with free access to GPUs and easy sharing. The code you (or someone else) writes can be stored in a notebook, which can then be shared with users who have a generic Google account. Tortoise shared resource is here.
The interface looks intimidating, but it’s not that bad. You must be logged in as a Google user, then click on “Connect” in the upper right corner. If this module does not upload anything to your Google Drive, other modules may. Note that the audio files it generates, on the other hand, are stored in the browser but can be downloaded to your PC. Small precision which is important: if someone executes a code written by someone else, it is possible that the user receives error messages, either because of a bad entry, or because Google has a problem in the background, like not having a GPU available. This is all a bit experimental.
The Tortoise Collab. Click the “Connect” button to get started, then click the little “play” icon next to each block of code in turn. (Credit: Mark Hachman/IDG)
Each block of code has a small “play” icon that appears if you hover your mouse over it. You will have to click on “play” on each block of code to execute it, waiting for each block to execute before moving on to the execution of the next one.
Without going into detail, note that the red text is modifiable by the user, like the suggested text that we want the model to pronounce. About seven blocks down, the user will have the option to train the model, name it and then upload the audio files. Once this is done, just select the audio template in the fourth block, run the code, and then configure the text in the third block. Finally, this block of code must be executed. If all goes as expected, the result is a small audio output of his voice sample. It works rather well, even larger than life.