Benedict Evans wrote a great piece about voice interfaces where he argues his position that voice interface isn’t quite the future of computer interface that we think it is.
He mentions a few constraints.
Given that you cannot answer any question, there is a second scaling problem – does the user know what they can ask? I suspect that the ideal number of functions for a voice UI actually follows a U-shaped curve: one command is great and is ten probably OK, but 50 or 100 is terrible, because you still can’t ask anything but can’t remember what you can ask. The other end of the curve comes as you get closer and closer to a system that really can answer anything, but, again, that would be ‘general AI’.
The interesting implication here is that though with enough money and enough developers you might be able to build a system that can answer hundreds or thousands of different queries, this could actually counterproductive.
Here he’s basically saying that the axe has a blade for a handle—the more work you get done (by adding more and more commands), the more damage you do to your hands (because nobody will be able to remember those commands).
There’s a set of contradictions here, I think. Voice UIs look, conceptually, like much more unrestricted and general purpose interfaces than a smartphone, but they’re actually narrower and more single-purpose. They look like less friction than pulling out your phone, unlocking it, loading an app and so on, and they are – but only if you’ve shifted your mental model.
This is where I think he’s wrong. It’s not a matter of “if” we shift our mental models; it’s a matter of when.
As he talks about elsewhere in the piece, Google and others are working on populating their systems with hundreds or thousands of the most common commands, and ensuring that the responses to these are rewarding and useful.
That’s all that needs to happen. Alexa has a small number of commands it can do, but it can do them really well. And when it fails, it fails gracefully.
Siri is the opposite. It makes you feel like you can say anything, but half the time she doesn’t understand you, and the other half she does something stupid even when she does. This hurts peoples’ confidence in the voice interface, and keeps it from becoming their default.
Minimum viable confidence
There is a confidence number—let’s call it 90%—where people will use voice for most things by default. Alexa is nailing the execution, but lacks the depth. Siri is lacking on both. So let’s say we’re somewhere around a 60% right now on this confidence score.
All it will take to hit that magic 90% is one or more companies to use the Alexa approach and competently solve the top n number of most common queries. And like Benedict said, it’s already being worked on.
Humans are relatively static animals, so the number of scenarios we’re talking about is relatively small. I have no idea the actual number, and I’m not sure anyone does, but I’m guessing it’s several hundred to a few thousand.
Once we hit that number—with Alexa-level quality—we will also hit the 90% confidence level that people have with voice interfaces, and it will then become a default for most daily tasks.
It’s true that some tasks don’t work well without a visual component that allows you to quickly scan lots of data and make a selection. Benedict gives the excellent example here of booking a flight. But I have a few counters for this point.
- Most day-to-day activities that you could use your assistant for aren’t in this category.
- There will be workarounds for this, such as detecting whether you have a display available and failing gracefully when that’s the case.
- Using machine learning to make intelligent guesses about what you would have chosen if you’d had the visual interface.
- Having a hybrid voice/display interaction available for when displays are available that make voice even more attractive. Gesture and eye-tracking tech will enhance this even further.
It’s only day zero
Finally, I think the most important thing he’s missing is the absolute Day Zero nature of our current offerings. Alexa wasn’t in homes three years ago. Five years ago, facial recognition and Go and Poker were untouchable by computers, and it was assumed that this would last for decades (or forever).
That was basically yesterday.
So it seems extraordinarily likely to me that mapping the top n number of daily human task requests—and reaching the “minimum viable confidence level” in voice interfaces—will happen within the next five years.
That is to say that people with Alexa-like devices (and perhaps their mobile devices as well) will have made the transition to voice-first as their default method of interacting with those systems, and that pushing and poking will be considered a fallback position when in a home setting.
We don’t need to solve every problem. We only need good enough. It’s a one or zero—we’ve hit the magic confidence requirement or not. And if we had Alexa-level responses to most everything we need during the day, that would get us there.
I take all Benedict’s points, but I think we’re so early in the game that we’ll find ways to address them using various techniques. I think voice becomes a natural and default interface for home-based computers within five years, and that mobile will come soon after.
- There is an interesting facet of this that will keep mobile push-and-poke for far longer: solitary mobile browsing probably makes up some massive percentage of total computing time. People sitting on a subway train aren’t going to be using voice. They’ll be pushing and poking just like they have been. So it’ll be interesting to see how people manage both paradigms in their minds, i.e., voice for issuing commands at home and while alone on mobile, but still manually swiping and typing when looking at Facebook, Reddit, etc. among others, at work, in line at the market, etc.
- I’m also not sure the Uncanny Valley analogy works well here. To me that would apply if the responses returned by a voice interface were almost right, but not quite, and the effect produced an uncomfortable sensation in the user. An example might be giving back a perfect set of words, but with the wrong tone. Or saying something formal while using informal language. It’s basically a near-perfect execution that makes it worse because it’s so close but not quite. So I don’t think the issue of not remember what commands to use would apply there.