How Speech Synthesis is Helping Alexa Grow Up

By ICS Development Team Tuesday, April 24, 2018

Do you pay attention to every word someone else says? Or even every word your home device says? Probably not. As humans, we listen for contextual clues to tell us when we need to pay closer attention. But how does your chatbot know to do that?

Think about it. Alexa is still a young girl, er chatbot, without the sophisticated communication skills of an adult. She’s relying on us — developers — to offer the guidance and direction she needs to elevate her listening, comprehension and speaking skills. For instance, she needs a little help understanding verbal emphasis and a speaker’s pauses.

If you want to give Alexa a more conversational tone with custom speech interactions, here are some ways you could customize your voice application.

The Power of SSML

You can use Speech Synthesis Markup Language (SSML) provided in The Alexa Skills Kit to help Alexa take her interactions to the next level, for instance customizing her speech to make it “more friendly.” SSML allows you to make Alexa sound more human, in part by emphasizing what is important to your interaction. The Alexa Skills Kit supports a subset of the tags defined in the SSML specification, as well as “Amazon:effects”.

Here's an example of how to structure conversational elements via Amazon’s interpretation of SSML — break up the text, add timing and emphasis — using that '80s classic 867-5309/Jenny.

Code Examples

Break
Break is great when you need a pause in the conversation for your user to think about something you just said or for them to get ready to receive important choices. For instance:

<speak>
Jenny who can I turn to<break time="1s"/>
You give me something I can hold on to<break time="1s"/>
Now you think I'm like the others before
</speak>

Sentence

Sentence is another way to add natural pauses to your conversation. It feels more natural to use sentence than to add breaks, but you’ll need to let Alexa default control the timing of your interaction. You can wrap your text in “<s>” or you can end with a period. Same thing. As a developer I prefer to see the formatting of the “<s>” wrapper to clearly show the distinction, though either approach works.

<speak>
    <s>Who saw your name and number on the wall</s>
    <s>Jenny I got your number</s> 
    I need to make you mine.
</speak>

Say-As

Say-as gives your text a format like a telephone number, fraction or spellout ( Y-E-S you can do that). This way, you don’t have to write out “eight-six- seven-five-three-oh-nine.”

<speak>
   Jenny don't change your number.
   <say-as interpret-as="telephone">86753</say-as>oh <say-as interpret-as="telephone">9</say-as>
</speak>

Note that I broke out “zero” since the singer says “oh” rather than “zero.” When dealing with real data it's important to use the appropriate data format so it can be processed correctly. See Alexa Reference for me information on say-as.

Prosody

Prosody modifies the rate, pitch, and volume of the tagged speech.

<speak>
    <p>
        I got it <break time=".5s"/>  <prosody volume="x-loud" rate="slow">I got it. </prosody>I got it.
        I got your number on the wall.
        I got it <break time=".5s"/>  <prosody volume="x-loud" rate="slow">I got it. </prosody>I got it.
        </p>
</speak>

Paragraph

When my son was young he wrote everything as one long paragraph. Great for James Joyce. Not great for a conversation with a chatbot. Using paragraphs to add spacing makes chatbot conversation more realistic. (Our song doesn't have any long paragraphs so I’m using a few lines like a paragraph for this example.)

<speak>
        <p>Jenny Jenny <break time=".5s"/> Who can I turn to? <break time=".5s"/><say-as interpret-as="telephone">86753</say-as>oh <say-as interpret-as="telephone">9</say-as></p>       
        <p>For the price of a dime I can always turn to you <break time=".5s"/><say-as interpret-as="telephone">86753</say-as>oh <say-as interpret-as="telephone">9</say-as></p>
</speak>

Emphasis

Emphasis alerts the user that it is important to pay attention to specific words. You can amplify or minimize the level of emphasis by choosing strong, moderate or reduced. I prefer moderate, which is the default.

<speak>
   You don't know me, but you make me so <emphasis level="moderate">happy </emphasis> 
  I tried to call you before but I lost the nerve
I tried my <emphasis level="moderate">imagination</emphasis>  but I was disturbed
</speak>

Full audio from our example code:

W (yes, the letter W as in George)

While not applicable to our song (you're still singing it in your head, aren't you?) we do need to know how to differentiate between pronunciations of similar words. For example, how does Alexa deal with the word lead? Is it defined as the heavy metal or as someone in charge, like the lead actor in the latest Zombie apocalypse flick? Similar to say-as, W allows you to specifying the word's part of speech:

Interpret the word as a noun (person in charge) /lēd/
Interpret the word as a past participle (led)
Interpret the word as a noun (the metal) /led/

Alexa does not care what part of speech it is. She just wants to know how to correctly pronounce it. We as humans understand the part of speech by the context, but she doesn't have this luxury.

Dialect

You can customize by location so that, for instance, users in London hear Alexa speak with a British accent and not some American twang.

Additional Customization Options

While you can create many of the same interactions and customizations described in this blog using Apple and Google, Amazon offers some additional features that let you further customize Alexa. For instance, check out Amazon's speechcons, special words and phrases that Alexa pronounces more expressively. Amazon also offers a host of interesting sound libraries.

It’s an exciting time to work with voice apps. As they mature and more development tools become available, your opportunities for enhancing your chatbots — making them even more lifelike — are accelerating. As you help guide Alexa from adolescence to adulthood, remember this: it’s our job as developers to overcome the technology challenges and deliver code that best represents a client’s brand.

Read more about Alexa and voice app development here.

External Links:

Link to Amazon’s Developer Portal Reference

Link to SSML W3C recommendations