I read and think a lot about how humans interact with computers, and what that interaction will look like at various points in the future.
I was going to call this a hierarchy of human to computer interfaces, but quickly realized that it’s not a hierarchy at all. To see what I mean, let’s explore them:
This is what most think of when they think “interface”, i.e. how you interact with the computer.
- Manual Physical Interaction: original key-based keyboards, physical switches, etc.
- Manual Touchscreen Interaction: smartphones, tablets, etc.
- Natural Speech: Voice: Siri, Google Assistant, Alexa
- Natural Speech: Text: messaging, chatbots, etc.
- Neural: You think, it happens. No mainstream examples yet exist.
A key part of that interaction, however, is how the computer returns content or additional prompts to the human, which then leads to additional inputs.
- Physical or Projected 2D Display: standard computer monitor, LCD/LED display, projectors, etc.
- Physical or Projected 3D Display: augmentation of vision using glasses, or projection effects that emulate three dimensions.
- Audible: The computer tells you its output.
- Neural Sensory: You “see” or “hear” what’s being returned, but it skips your natural hardware of eyes and ears.
- Neural Direct: You receive the understanding of having seen or heard that content, but without having to parse the content itself (NOTE: I’m not sure if this is even possible).
Technology limitations vs. medium limitations
Given our current technology levels, we’re still working with Manual Touchscreen Interaction and Display output for the most part, and we’re just starting to get into Voice input and output.
But like I mentioned above, this isn’t a linear progression. Voice isn’t always better than visual displays for displaying information to humans, or even for humans giving input to the computers.
Benedict Evans has a great example:
@Rotero try choosing a flight on the phone
— Benedict Evans (@BenedictEvans) February 5, 2017
My favorite example is Excel. Imagine working with a massive dataset like so:
Read row one-thousand forty-three, column M…
…and your dataset has 300 thousand rows and 48 columns. Seeing matters in this case, and voice might be able to help in some way, but it won’t replace the visual. It simply can’t because of bandwidth limitations. When you look at a 30″ monitor with massive amounts of data on it you can see trends, anomalies, etc.
And that doesn’t even include the concept of visuals like graphs and images that can convey massive amounts of information very quickly to the human brain. Voice isn’t ever going to compete with that in terms of efficiency, and that’s not a limitation of technology. It’s just how the brain works.
Hybrids mapped to use cases
The obvious answer is that various human tasks are associated with ideal input and output methods.
- Voice input is great if you’re driving.
- Text input is great if you’re in a library.
- Voice output is great if you’re giving your computer basic commands at home.
- Visual output is ideal if you need to see lots of data at once, or if the content itself is visual.
- Neural interfaces are basically hardware shortcuts to all of these, and it’s too early to even talk about them much.
Voice vs. text
One way I see voice and text that I’ve not heard anywhere else is to imagine them as different forms of the same thing, i.e., natural language. You’re using mostly natural language to convey ideas or desires.
Show me this. I’ll be right there. Tell him to pick me up. I can’t talk now, I’m in a meeting. Order me three of those.
These are all things that you could do vocally or via text. There are of course conventions that are used in text that aren’t used in vocal speech, but they largely overlap. Text, in other words, is a technological way of speaking naturally. You’re not sending computer commands; you’re emulating the same speech we had 100,000 years ago around the campfire.
Common reasons to use text vs. voice include lower social friction, the ability to do it without being as disruptive to others around you, etc. But again, they’re very similar, and in terms of human to computer interface I think we can see them as identical save for implementation details. In both cases the computer has to be good at interpreting natural human speech.
The key is being able to determine the ideal input and output options for any given human task, and to continue to re-evaluate those options as the technologies for each continue to evolve.
- There are many ways for humans to send input to, and receive output from, computers.
- These methods are not hierarchical, meaning voice is not always better than text, and audible is not always better than visual.
- Voice and text are different forms of “natural language” that computers need to be able to parse and respond to correctly.
- Human tasks will map to one or more ideal input/output methods, and those will evolve along with available technology.