Coding an Amazon Alexa skill: Trips & tricks for game development

Amazon’s Alexa is celebrating her second birthday, and there are over 8 million Alexa devices now in American homes. The UK and Germany are joining this market and Alexa is leaving Apple’s Siri and Microsoft’s Cortana behind. Now Amazon is looking for developers: there are 7,000 skills available in the Alexa skill store but most of them are simple gimmicks. There is nothing that sells a device better than games, so that’s what Amazon wants now.

For indie game developers, this is wonderful. Voice-based gaming is a whole new user interface metaphor which presents a fantastic challenge. Which types of of games will work, which won’t? What are the limits of Alexa as a gaming platform? Is Alexa’s voice recognition accuracy good enough to offer a good gaming experience? In this article, I’ll share a few important things I’ve learned developing an adventure game engine for Alexa.

1. Your maximum response length is 90 seconds

If you’re doing an audio adventure similar to a radio play, there’s a hard time limit for every bit of dialogue you want to present. Everything you have Alexa say must fit within 90 seconds for each response. This includes all text that Alexa must speak, pauses, and embedded audio MP3 files. Anything more will be cut off. The developers of The Baker Street Experience found this out the hard way and ended up restructuring all their dialogue – after the voice actors had already done their job.

For a voice-based game platform, it seems 90 seconds is a reasonable length for a response. Players will not want to listen to 4 to 5 minutes of dialogue without any interaction; it would be the same as listening to an audio book. If you are designing a game with lots of dialogue, bear this limit in mind and cut your scenes up into smaller bits, and require the player to interact with your game to move between the scenes. You’ll end up adding more choices to your game, but that’s probably a good thing.

2. After 16 seconds of silence, your game dies

After Alexa has finished her response text and other audio, the player is required to answer within 8 seconds. After that, Alexa will remind the player that she’s waiting for a response by speaking a reprompt (an additional bit of text you can specify in your code). Alexa will then wait another 8 seconds. If she receives no response within that time, she’ll end the skill without a message (except a soft “ding” sound).

This is hard-coded and intentional. According to Amazon, Alexa works this way to prevent skills from listening in on device owners for extended periods of time, which could be a security issue. You’re going to have to develop your game to account for this fact. It is also worth noting that Alexa will not listen for player’s response while she is still speaking. We did an experiment where we would provide a short, 5-second response, then add 60 seconds of silence to see if we could stretch the time available to the player to respond, but with no success. As long as Alexa is still outputting her response, whether it contains silence or not, she won’t accept any voice input. It is also not possible to have Alexa reprompt more than once, because this would mean another security flaw: the additional reprompts could be silences and Alexa would still be listening in on users for an extended time.

It’s best to prepare your game for this behavior by making its startup text very short. Players will need to invoke your skill every time it ends prematurely, and it would be boring to have to listen to a long introductory text. On the other hand, an even better solution is to store the game state in a database after every interaction. This way, whenever your skill is restarted, you can look in the database to see if there is a saved position and restore it immediately, simply prefixing “Let’s continue on…” or similar to whatever Alexa speaks next. This way, it is acceptable to reinvoke the skill and not lose any time.

3. Alexa does not pronounce everything correctly

While Alexa does a very good job of converting your text to speech, she won’t be able to pronounce everything correctly – especially words not often used and foreign words. It is important that you listen to all game text as spoken by Alexa and make corrections. If you find a word she doesn’t pronounce well, you can always define the word in the speech SSML using phonemes:

<speak> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. 
I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>. </speak>

Alexa also has a habit of putting too short of a pause between words. You can fix this by inserting an artificial pause:

<speak>Hello <strong><break time="50ms"/></strong> world.</speak>

4. You cannot alter Alexa’s voice

Alexa comes with one voice only, and that’s something you can never chance. She’ll never speak with a man’s voice, for one thing. Interestingly, she does use different accents for the US market and the UK market, and she obviously speaks German for the German market. She’ll speak more languages yet as Amazon expands into more language-specific markets.

Still, it would be nice if you could add other aspects of Alexa’s voice, like speaking faster or slower, higher or lower, or with an echo. All of this, sadly, is impossible at this point. If you want specific voices, the only thing you can do is add audio samples in Alexa’s responses, or replace her responses with audio samples entirely. Radio play style games tend to do this. As a developer, you’ll have to hire voice actors for this, but it adds a new dimension to your game. But there are downsides to this approach, as well. With prerecorded answers, it will be difficult (or very expensive) to add many different responses or varying responses to your game, which is something you can fairly easily do with text converted to speech by Alexa. With voice actors, therefore, your game will likely be limited to a short story with up to 100 responses or so, which with text, the sky is the limit.

Note, also, that any audio files you want Alexa to play must be hosted by you (and not AWS – although hosting on S3 is possible) on a SSL-enabled server, at a bitrate of 48 kbps and a sample rate of 16000 Hz. The resulting audio files are of less-than-excellent quality (for music), but OK for voices.

What you can do – and I do so in one game – is manipulate the actual text that Alexa speaks using a couple of gimmicks. Alexa has a robotic voice – why not use this to your advantage? If Alexa is telling the story as a robot, you could have her stumble over some words and repeat them (a couple of Alexa’s routines backfired) or insert random short pauses in her text. It adds an interesting dimension.