Converting Text to Speech with Azure Cognitive Services and PowerShell -- Microsoft Certified Professional Magazine Online

Converting Text to Speech with Azure Cognitive Services and PowerShell

Thanks to the AI features in Microsoft's cloud, text-to-speech technology can render more realistic results than ever.

By Adam Bertram
05/23/2019

Imagine being able to have any text read to you like a book. No, not by your mama, but by artificial intelligence (AI).

Text-to-speech technology has been around for a long time, but it's always sounded robotic and it was blatantly obvious that a computer was translating the text to speech. Now, with the Text to Speech API in Microsoft's Azure Cognitive Services API suite, we're able to get text read in a way that's nearly indistinguishable from a human. How so? I'm glad you asked!

Prerequisites
To get started, you'll need an Azure subscription and an Azure Cognitive Services account. Technically, you could get a trial API key, but I always prefer to test with the real thing. Besides, we can still make it happen for free using the F0 SKU.

Also, if you haven't already, be sure to check out my other article on setting up your Azure Cognitive Services account with PowerShell. I'll be assuming that you've already authenticated to Azure in your PowerShell session.

There are many ways to interact with the text-to-speech API using various SDKs. To keep this article as generic as possible, we'll work directly with the REST API and use PowerShell as the language of choice.

Getting a Token
Once you've got a PowerShell session open, authenticated to your Azure subscription and set up an Azure Cognitive Services account, you should be able to retrieve your API keys. You'll need one of these to authenticate to Azure Cognitive Services. We do this by running Get-AzCognitiveServicesAccountKey and providing the resource group name and name of the Azure Cognitive Services account we set up earlier.

PS> $keys = Get-AzCognitiveServicesAccountKey -ResourceGroupName 'CognitiveServices' -Name 'Testing'
PS> $keys

Key1                        Key2
----                        ----
XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX

You'll only need the first key to authenticate. Once you've got the key, you'll now need to get a token. This token is only available for 10 minutes, so if you suddenly begin to get errors relating to authentication failures, be sure to get another one. The code to so do is below:

$headers = @{
    'Ocp-Apim-Subscription-Key' = $keys.Key1
    'Content-Length'            = '0'
}

$params = @{
    'Uri'         = 'https://eastus.api.cognitive.microsoft.com/sts/v1.0/issuetoken'
    'ContentType' = 'application/x-www-form-urlencoded'
    'Headers'     = $headers
    'Method'      = 'POST'
}

$token = Invoke-RestMethod @params

The only item that you may not immediately know is the endpoint URI. In this article, mine uses the East US region because this is where my subscription is located. For a full list, take a look at the Microsoft documentation.

Querying the API
There are essentially two endpoints you can use with the text-to-speech API: getting information like a voices list and doing the text-to-speech conversion. The endpoint to get information is https://.tts.speech.microsoft.com/cognitiveservices/.

We can query this endpoint URI by providing an endpoint such as voices/list to find all of the available voices in our region. Notice that I must use the Authorization header and specify Bearer <MyToken> to authenticate:

$headers = @{
    'Authorization' = "Bearer $token"
}

$params = @{
    Headers     = $headers
    Method      = 'GET'
    Uri         = "https://eastus.tts.speech.microsoft.com/cognitiveservices/voices/list"
}
Invoke-RestMethod @params

Converting Text to Speech
Now that we have all of the boring stuff out of the way, we can now do some actual conversion. We have a ton of options here but to demonstrate one, I'll convert the text "Hello, you're now reading MCPmag" to speech using the JessaNeural voice agent and return the audio as audio-16khz-64kbitrate-mono-mp3. I'll also need to structure the speech synthesis markup language (SSML) in a way that's passed via the HTTP body to define the voice, text and gender of the voice I'll be using.

To do a simple conversion:

$audioOutput = 'audio-16khz-64kbitrate-mono-mp3'
$voiceAgent = 'JessaNeural'
$outputFile = 'C:\MCPMagTest.mp3'
$gender = 'Female'
$text = "Hello, you're now reading MCPMag."

$ssml = @'

{2}

'@ -f $gender, $voiceAgent, $text

Once I set up all of the required input, I then need to craft the REST method specifying the authorization header key along with the X-Microsoft-OutputFormat header to define the type of output:

$params = @{
    'Headers'    = @{
        'X-Microsoft-OutputFormat' = $AudioOutput
        'Authorization' = "Bearer $token"
    }
    'ContentType' = 'application/ssml+xml'
    'Body'       = $ssml
    'Method' = 'POST'
    'Uri' = 'https://.tts.speech.microsoft.com/cognitiveservices/v1'
    'OutFile' = $outputFile
}
Invoke-RestMethod @params

You should now have an .MP3 file saved on your computer with Jessa telling you you're reading MCPmag.com.

About the Author

Adam Bertram is a 20-year veteran of IT. He's an automation engineer, blogger, consultant, freelance writer, Pluralsight course author and content marketing advisor to multiple technology companies. Adam also founded the popular TechSnips e-learning platform. He mainly focuses on DevOps, system management and automation technologies, as well as various cloud platforms mostly in the Microsoft space. He is a Microsoft Cloud and Datacenter Management MVP who absorbs knowledge from the IT field and explains it in an easy-to-understand fashion. Catch up on Adam's articles at adamtheautomator.com, connect on LinkedIn or follow him on Twitter at @adbertram or the TechSnips Twitter account @techsnips_io.