Editor’s Note: This is a guest piece by Ahmed Bouzid, founder and CEO of Witlingo and former Head of Product at Amazon Alexa
I remember a rush of pure adrenaline upon my first encounter with the Amazon Echo in the fall of 2014, when one of the executives on the Alexa team who was interviewing me for a job with the Amazon Alexa team demoed it to me while we sat in a small, sparse, typically frugal Amazonian private room, with two sofa seats and one small low table between them. The black Echo tower stood majestically on that small table, and the manager started interacting with it. As he invoked it and asked it to play some Jazz (it played something from Wynton Marsalis), Arthur C. Clarke’s hackneyed quote, “Any sufficiently advanced technology is indistinguishable from magic,” came to mind.
Stunned with what I was witnessing, I quickly chased that thought away, feeling that the quote was not appropriate for the moment: there was nothing hackneyed or mundane about what I was witnessing. This was a serious step forward, not just for voice, but in the history of technology; I remember thinking. I also remember thinking how this executive in front of me, who had been pulled into the Alexa team from some other organization within Amazon just a few months earlier and who had zero background in speech or human language technology, probably had no clue about the significance of what his team had managed to create. But that thought too, I quickly chased away as I started to talk to the thing: having a conversation that went back and forth several turns. (I remember asking for some Mozart, then the weather, and then I asked a couple of questions).
Only someone like me (and we are many, many of us), who had labored and suffered for so long in the “Speech Technology” space, could truly appreciate that moment. The Amazon had accomplished five feats in one fell swoop, and I wanted to stand up and just start applauding. But I knew that doing something so emotional and dramatic was just not the Amazonian way of acting (being cold-blooded and unemotional is the Amazonian zeitgeist), I controlled myself: after all, I was interviewing for a job and, now, having seen the thing, I wanted in very badly.
Now, several years later, fully invested in the rapidly growing world of conversational voice first, with my work at Witlingo, my company, as well as my engagement in the Open Voice Network, I want to take a moment and give credit where credit is due. And I want to do this not only because it is the right thing to do but also because I want to make sure that when I express criticism of the Echo and the Alexa platform, my expressions are understood as coming from an admirer of what Amazon has accomplished, and not from a reactionary detractor.
Here are the five main feats that I believe the Amazon Echo was able to pull off and pull off majestically.
Speech (taking spoken sound and converting it accurately to text) is a hard enough problem as such. In fact, it is one of the hardest problems in AI — harder than, say, Vision. Far Field speech is a whole new beast with an order of magnitude of greater difficulty. Add to that two additional levels of complexity: (a) the conversational dimension: this is not dictation or command and control, but back and forth conversation. (I remember almost melting with admiration thinking of whoever had decided that they were not going to just have the Echo answer questions and go quiet, which would have been staggeringly impressive in and of itself, but also, in some instances, have it keep engaging with follow ups), (b) the ambient noise dimension: the thing was really robust to background noise, which, for someone like me, who came from the world of telephony IVR, was impressive, and (c) the latency dimension: this is the part that just floored me. It responded almost instantly — a capability that to this day, 5 years and change on, I find miraculous.
Voice as an Interface and not a Feature
Before the Amazon Echo, my prior magical Voice First moment of bliss was the launch of Siri on October 4th, 2011, with the release of the 4S. Up to that point, the only way to use speech on your smartphone was through an app. The Siri app was such an app, and I had found it impressive in what it could do. But now, with Apple pulling the app (actually a small apportion of the app) into the iPhone — a mainstream product that was going places — well now, speech had finally arrived!
Yes, speech had finally arrived, and it had arrived and was being welcomed in the house of the iPhone, where it was given a spacious room to lodge in, alongside the other rooms – the camera, the phone, the texting, etc. That was a step up. It was pulled into the house, so to speak, had come out from the great outdoors where it was camping in its proverbial App Store tent — its app presence. Voice was not respected: it was a full-on feature.
What the Echo did with its black tower, where the only input from the user is mainly what the user says and their ability to mute it by pressing a button, and the output is audio and a blue ring that lights up when it starts listening and that goes off when it stops — with no screens to look at or touch — in this stark, spartan form factor, voice was no longer a mere feature: it was an interface. And not just an interface, but an interface that took voice seriously. Voice is about eyes-free, hands-free interactions: you use your voice when you can’t or don’t want to use your eyes or hands. Voice is about you being able to ask while potting your plant on a Saturday, what time Home Depot opens that morning; or asking about who had the most home runs in the American league while in the middle of a conversation with your son; or asking for Miles Davis at the exact moment when the mood hits you and you are in the middle of writing a sentence and are settled in a comfortable mood that you don’t want to disturb that mood by looking for your smartphone and risking being distracted by, say, some beckoning Twitter badges that you know you wouldn’t resist delving into if you were to catch sight of them.
The Emasculation of Nuance
The arrival of the Amazon Echo meant the end of an era: the era of a bully company called Nuance wantonly abusing the many small fish that swam within the small pond of speech technology in which it lurked like a shark. Nuance — which in reality, if you look under the hood, is Scansoft, which acquired a great company (with SRI roots dating back to 1994) called Nuance in 2005, and which jettisoned that company’s elegant, cutting edge technology but donned its great name (whoever makeup with that name should be added to the Branding hall of fame) — was infamous for three things: (a) It wanted to dominate the space of speech by any means possible: not only by assiduously working to deliver a better product (which it never did do assiduously), but also by buying out companies (Voicebox 2018, Varolii 2013, Ditech Networks 2012, Vlingo 2011, Loquendo 2011, PerSay 2010 , Spinvox 2009, BeVocal 2007, to name a few) and by suing startups (M*Modal 2017, vLingo 2008, TellMe 2006) that it couldn’t buy or didn’t want to buy, with its goal being, of course, the ability to charge a premium for the ports that it was selling; (b) As a result, clients had very few choices when it came to purchasing business grade speech recognition and voice browsing for telephony IVR (VoiceXML browsers). (I personally remember being part of negotiations between a company I worked at, and which was itself acquired by contact center giant Genesys, and Nuance, to purchase 1,000 ports (a million-dollar deal, at least), and the finalists for that bid were Nuance and Loquendo. This was in early August 2011.
The negotiations were proceeding well enough on both fronts — with Nuance and Loquendo — until suddenly, Loqueno stopped answering emails or phone calls. They stopped answering the emails and the phone calls of a potential buyer who was seriously considering spending at least $1 million with them and probably a lot more through the years ahead. Strange! But not so strange when a couple of weeks later, the announcement was dropped that Loquendo was acquired by Nuance for $75.5 million. Facing deadlines and project and product schedules, we, the buyers. had no choice but to sign under the line which was dotted, violated and abused as we felt doing so); (b) It competed against the very clients to whom it sold its technology: that’s right, Nuance sold Speech Technology (Speech Recognition and Text to Speech, as well as a VoiceXML browser) to its partners, who used these technologies to deliver solutions (for instance, building an IVR for the IRS to help them offload some of the questions coming it at peak time), but also itself bid on contracts to be a direct solution provider (for instance, serving the IRS directly). And since it controlled the pricing of the software that it sold, there was no contract where it could not win if the price was a determining factor (and it often, but not always, was).
The arrival of the Amazon Echo, and later Google Home, has not put a complete stop to all of that destructive behavior, but it has stemmed it and has created some breathing space for companies to build speech solutions without looking behind their back, or refraining from building in the first place. Nuance can’t afford to take Amazon to court, and can’t afford to buy it. And so, it needs to co-exist and compete. (That the speech division of Nuance is still in existence, to be perfectly honest, baffles me. Their technology is just not up to par. A lot of money is being left on the table that is for the grabbing for the energetic and ambitious salesperson.)
Pulling in the Humanities
By making the delivery of voice based conversational experiences a reality, the Amazon Echo has opened up a whole new world of possibilities for people dwelling in disciplines that just had no role in the product delivery stack prior to the arrival of the Echo. Now English majors, Linguists, Musicologists, Performance artists, anthropologists, neuroethicists, among other disciplines in the arts, the humanities, and the social sciences, need to be pulled in to help innovate. Current UX designers and product managers with no experience in voice simply do not have what it takes to begin thinking about how to design for voice-based experiences. They have lived all of their life in the visual tactile world, and the delta between a voice-conversational centric product and a visual-tactile product is significant. People who understand language and how humans engage with other humans are needed — and needed badly if we are going to move forward from the loathsome, highly ineffectual IVRs that continue to populate our world today.
Kicking Off the Privacy Conversation
Last — and we are only in the early phases of this development — the emergence of the Amazon Echo, with its bold, daring, but honestly open and on-the-record declaration that in order for it to do its job, the device needs to be listening all the time, has helped spark a debate about privacy that thankfully remains to be resolved and has not tipped the balance permanently in favor of convenience. Most people have accepted the Echo into their homes and are using it on a daily basis (many having bought several of them and have populated their home with them), having long ago decided that the jig was up and that they were surrendering themselves to the delights of convenience, let come what may come — quis erit, erit! — as far as privacy was concerned in our brave new world. But thankfully, not a trivial portion of the population has reacted to the Echo with a healthy: ‘What? You want me to bring a speaker into my house and do what? Let it spy in and listen to what I am saying all day long? Are you nuts?’”
I say this is healthy and a positive contribution to the debate because the obvious retort to that snorting reaction is: but your smartphone is also listening to you all the time in the same way — and worse, in fact, since it is with you all the time, it is listening to you in the home (including the bathroom), in the car, in the office, at the doctor’s office, when you are consulting with your lawyers, and so forth. If you want to worry about the Echo and other smart speakers, that is a good thing. But let’s expand our worrying to all of the devices, and let’s talk about not just your loss of privacy because of what the device is listening to, but also how you are making yourself vulnerable in the many, many, non-voice ways that a smartphone gives you to share your information.
And now, on to prodding the Echo and its many other incarnations to solve the many problems that they have yet to solve. I will delve into some of that in a future article, tantalizingly titled: “The Seven Scandals of Conversational Voice First.”