Getting Started with Azure's Text-to-Speech API -- Microsoft Certified Professional Magazine Online

Getting Started with Azure's Text-to-Speech API

Azure Cognitive Services is letting developers create natural-sounding speech even without a lot of expertise in machine learning. Here's how.

By Adam Bertram
06/18/2019

Traditionally, when a computer has attempted to convert text to speech, any human could tell it was computer-generated and not read by a human.

Artificial intelligence (AI) is changing that nowadays with the power of neural networks. AI can now learn what a human is supposed to sound like -- and the results are phenomenal. Just listen to some of the examples built on the Microsoft Azure Cognitive Services text-to-speech feature.

One way to create natural-sounding speech from text is to use the Azure Cognitive Services text-to-speech API. This is a service that developers and admins can use without knowing the ins and outs of machine learning. They just need to know how to call an API method.

Getting started with text-to-speech is easy. You don't even need an Azure account. The text-to-speech service comes with a free seven-day trial. After that, a free Azure account is required to continue using the service at no cost.

When you sign up here, you'll receive an API key. Protect your API key! This key allows you to authenticate to Azure to obtain an access token that you use throughout your session, whether you're using one of the supported language SDKs or the REST API.

If you're a developer needing to integrate text-to-speech into an application, you're probably going to use one of the language SDKs. As a non-developer trying to learn text-to-speech, I've built a PowerShell module that helps understand the REST API and what kinds of voices you can use.

If you'd like to follow along with the rest of the article, you'll first need to create an Azure Cognitive Services account. You'll also need to be on Windows and have a PowerShell console set up as administrator. Then, you'll need to install the AzTextToSpeech module using Install-Module AzTextToSpeech.

Once you've got the PowerShell module installed, open up configuration.json in the C:Files<VERSION>.json and input your specific Azure Cognitive Services account attributes inside. Head over to GitHub for a quick Getting Started guide.

Once you've got the AzTextToSpeech module setup, you're now ready to begin testing the text-to-speech API. First, to ensure your key is correct, run Get-VoiceAgent. If a list of available voices is returned, you're good to go! If not, ensure the API key is correct in your configuration.json file.

Next, you'll use the ConvertTo-Speech command. This command takes care of nearly all of the hard parts of getting accustomed to working with text-to-speech like getting an access token and crafting the SSML.

To quickly test out the service, you only need to run ConvertTo-Speech once using a few parameters. Below is an example of an API call that converts the text in the Text parameter that will be voiced with the GuyNeural voice, which will then create an .MP3 file with the specified audio format. Once complete, it will then open up the .MP3 file in your default audio player and play it.

$parameters = @{
    Text        = 'Hello, I am Guy. I am an option for a neural voice.'
    VoiceAgent  = 'GuyNeural'
    AudioOutput = 'audio-16khz-128kbitrate-mono-mp3'
    OutputFile  = 'C:\convertedfromtext.mp3'
    PassThru    = $true
}
ConvertTo-Speech @parameters | Invoke-Item

This is a bare-bones example that allows you to test out various voices. To pick a different voice or audio output, use PowerShell's parameter tab completion feature once you've typed the parameter -VoiceAgent or -AudioOutput by hitting the space key, and then begin hitting the tab key. This will cycle through various options and allow you to test out what each voice sounds like.

If you'd like a complete soup-to-nuts walkthrough of the Azure Cognitive Services text-to-speech feature, I encourage you to check out my Pluralsight course, which covers everything you need to know about getting up and running with text-to-speech.

About the Author

Adam Bertram is a 20-year veteran of IT. He's an automation engineer, blogger, consultant, freelance writer, Pluralsight course author and content marketing advisor to multiple technology companies. Adam also founded the popular TechSnips e-learning platform. He mainly focuses on DevOps, system management and automation technologies, as well as various cloud platforms mostly in the Microsoft space. He is a Microsoft Cloud and Datacenter Management MVP who absorbs knowledge from the IT field and explains it in an easy-to-understand fashion. Catch up on Adam's articles at adamtheautomator.com, connect on LinkedIn or follow him on Twitter at @adbertram or the TechSnips Twitter account @techsnips_io.