Grabbing the Attention of Siri and Alexa is Tougher Than You Think

By ICS Development Team Wednesday, February 28, 2018

Voice will soon be in every gadget. At least, that’s the vision a lot of industry watchers share. And, they’re probably right. Already, Alexa, Siri, Cortana and Google Assistant will happily listen to and act upon your every word. Who can resist that?

Certainly not consumers — something brands have noticed. That's one reason a number of heavyweights, among them Starbucks and Nestlé, have developed their own brand-specific voice assistants and voice-skill apps. The goal is to ensure that the information served up by Alexa and her cronies includes them, a clever marketing strategy since the algorithims of most current-generation voice assistants only offer one or two product suggestions for every query.

To stay in the game, Nestlé created Nestlé's GoodNes skill for Alexa, which offers consumers recipes and step-by-step cooking instructions. Starbucks developed My Starbucks Barista, which lets customers order and pay for their caffeine fix via voice commands using a chat-based protocol.

But for voice to reach its potential — to truly deliver great (even transformative) user experiences — companies developing these apps and voice-enabled devices need to tackle some tough technical challenges.

Yes, voice technology is mature enough to build amazing experiences. Still, it’s not that easy. Here’s why.

Voice Isn’t Just One Thing

It’s natural language processing. It’s machine learning. It’s APIs. And it’s myriad different platforms for voice interaction, which are constantly evolving.

When building a voice app, you have to understand all these technologies. Plus, you need to address the bigger picture. Does your app have to interact with your CRM? How about with your sensitive medical device or your in-vehicle infotainment system?

What else makes a great voice user experience difficult to create? Complex interactions. If you have only two models of a product to sell, voice interactions are pretty straightforward. But, what happens when you have several products and require an open-ended interaction? It’s a given that most of your customers will need personalized help geared toward their individual desires. That means the voice agent has to be designed to interact around a series of intentions, entities and interactions.

Voice might seem easy initially, but once you’re into the weeds the complexity becomes apparent.

Building a Voice Agent

To be effective, the agent must maintain the context of the conversation all the way through the interaction — much like a human interaction. To achieve this, the developer must carefully architect both intentions and entities to make sure the user feels like the agent is helping him or her specifically.

To illustrate this complexity, let’s build a voice app for a hypothetical product. Let’s say I run a mail-logistics company. I sell a tool that sorts your mail, recycling your junk mail and shredding pieces that have your name or identifying information. This tool attaches to your mail slot, saving you time from having to manually sort the mail yourself. (You’d buy that, wouldn’t you??)

The voice app is used as the customer service agent. You can ask about different products and the agent will help you identify and buy the right one — hopefully my mail sorter — and offer you help when you need it.

We will call our agent “Agent 99.” It’s Agent 99’s job to listen to you so that it may:

Understand your intentions
Collect your entities to better understand those intentions
Fulfill your requests
Learn for future interactions

One agent can be used through multiple platforms, such as voice, Slack or website. Google, Apple and Amazon all sell devices that listen to you. They convert your spoken words to text and process language using a agent.

Intentions are actions correlated to elicit a specific response. For example, you might say, “I’m looking for a cheap junk mail shredder.” Or, “What’s a inexpensive shredder for junk mail?” From this, Agent 99 knows many things about your intentions.

For instance, it knows you:

Want to buy a junk mail shredder
Are not looking for repair services
Want to hear about low-cost options

Agent 99’s logical response is to respond enthusiastically, “We have many affordable options” (notice we changed the wording) and to offer you a few low-to-medium-cost shredders. But for a better customer experience, the agent needs to probe further. “What type of usage is the shredder for? Office or home?” Agent 99 must be engaging and helpful but not pushy.

An intention is mapped to an action and is processed for fulfillment. The best AIs use machine learning to get better at evaluating a situation. They use data they already know about you, as well as data they know about others like you to help you.

They also use entities, which are like containers in that they hold similar types of data. They allow developers to add synonyms that help improve the customer experience. For instance, our junk mail sorter has the following entities:

Cost: “low cost,” “cheap,” “affordable,” “inexpensive”

Usage: “industrial,” “office,” “business”

Color: “green,” “bright,” “modern chic,” “retro”

Features: “super quiet,” “long life,” “cross blade”

Training the Agent

To be helpful, Agent 99 needs to know the right thing to say. That takes training. Machine learning allows agents to continuously get “smarter” by learning from each conversation. The best platforms have a training mode that captures all the intents for which the agent wasn’t sure how to respond and uses those to teach. Default responses are helpful for conversation starters until the agent has amassed enough information to truly help the user.

Some agents use small talk to make the application seem more human. The agent can be programmed to provide humorous responses to specific questions. (If you ask Siri if she’s a robot, she’ll respond with a little attitude: “I’m not sure what you’ve heard, but virtual assistants have feelings too!”)

While setting up agents is easy, the agents themselves can get quite complicated. That’s why implementing a tool to capture information from all of your users is essential, though it requires a good understanding of interactive architecture.

What’s the Takeaway?

Creating a great user experience with voice is complicated, but essential. Take time to meticulously plan out your user journey and the underlying architecture of your data, and invest in your application before you get stared so it can grow with you.

ICS develops voice applications with a strong focus on creating an exceptional user experience. If you’re thinking about building one and need an assist, get in touch.

Want to know more? Check out these links:

Landing Page Chatbots

Half of Consumers Want Interactions Via Chat

How Bots Will Completely Kill Websites and Mobile Apps (from 2015 but still relevant)