Confession: I basically upgraded from the iPhone 4 to the 4S just to mess around with Siri. ((That, and the new cameras. The iPhone is the only camera I use. With two tiny kids, the camera comes out a lot.)) While the experience has been magically delicious in nearly all respects, one can’t help but continually bump into what feel like arbitrary walls. Siri can apply a relationship to a person (“Joel is my brother”), but she can’t change his birthday or move him to the top of my favorites list or perform thousands of other seemingly trivial actions. Like many others, I’m delighted by what Siri can do yet frustrated by the current limitations.
Apple has cracked open a door of possibility with the introduction of Siri. It’s not the first interface to accept voice as an input, but it might be the first to do it in a way that’s both accessible to the casual user and popular enough to matter.
Those who are quick to dismiss Siri as a gimmick cite the aforementioned functional limitations, the awkwardness of speaking aloud in public places, and the latency and artificiality as compared to science fiction’s portrayal ((See TNG, among others.)). These are all true. Many features of the iPhone are unavailable via Siri. It would be weird for someone in an office or on a bus to start talking to his or her phone. (Weirder than the Bluetooth headsets people already use?) Needing to wait for Siri to transmit and fetch data from a distant server, enunciating with excruciating precision, and finding oneself at the beck and call of those chipper beeps can be disenchanting. Yet what are these but the pains of an infant technology cutting its teeth in a world of mature graphical user interfaces? Should we reject voice-driven user interfaces a priori, scorning the possibility of hardware and software improvement?
We have, right now, a useful tool and a tantalizing glimpse at what is possible. That’s more than enough for me.
The immediate future looks clear. Apple will continue to refine the Siri experience by removing obstacles and adding features. The foundation appears to be in place for long term growth. I have had few issues with Siri understanding my speech, and that seems to be the common experience.
What we all want to know is: how soon can Apple open the app floodgate? It’s a bewitching notion. The iPhone before apps was revolutionary. The iPhone after apps, indispensable. Can the same be true of Siri?
A Hypothetical Scenario for Siri-alizing Apps.
First, let’s give Siri the ability to open apps, something that it can’t do right now. ((Application launching is something of a middle ground for me. While I believe Apple is most interested (and ought to be) in unleashing speaking and listening as a peer experience to looking and touching rather than voice as simply an alternative for your finger, I expect them to make small compromises in that direction. Essentially, there’s no reason voice shouldn’t make the whole experience richer rather than living in a one-dimensional ghetto.)) “Siri! Launch Tweetbot.” ((Where by “Siri!” I mean, “press and hold the home button until Siri launches.”)) Tweetbot appears on the screen. Because we’re smitten with this Siri thing, we want the ability to perform actions in our current context.
Consider what happens next. Since Tweetbot saves my state automatically, I’m looking at my “Sports” Twitter list. From this screen alone, I can: Change the list I am viewing, open the compose tweet screen, refresh the list, search the list, switch accounts, select a tweet as the target for additional actions, switch to my mentions, direct messages, starred tweets or profile, or view replies to a tweet. That’s one screen, and I probably didn’t even provide a comprehensive inventory of available actions.
“Refresh tweets” might be a perfectly adequate synonym for the pull-to-refresh mechanic we’ve become accustomed to, but what if I want to interact with a specific tweet? Should a “cursor” appear on the screen indicating the currently active tweet? Of course not. Tweetbot, like every other native iOS app, has been designed with touch as the foremost interaction method. ((Apple’s incredible accessibility achievement with the iPhone notwithstanding.)) By attempting to force voice input into our current graphical conventions, we’re in jeopardy of the same errors game developers have routinely made in attempting to port joystick-based games to the touch environment. What was developed for one input, especially if the input was properly understood, is inappropriate to varying degrees for us in another. Furthermore, within this scenario, we have created for ourselves both the non-trivial job of replicating all screen functionality as voice functionality and restricted what we can do with voice to what we can see on the screen.
What’s the Alternative?
As much as I would like to see Siri become a tool for users willing to spend the time necessary to learn the interface, ((Like Quicksilver or Enso.)) Apple appears to be determined to create something else, something that hasn’t really been done before: a conversational user interface. You state a command, Siri complies (if possible) and provides feedback. It’s a much longer, more tedious process, but it might be the only one that can actually work without extensive training.
So what should Apple could do to truly embrace voice-driven user interfaces? First, abandon the traditional concept of applications. In the world of Siri, applications are incidental. Data sources matter, commands matter, natural language parsing matters—applications are the occasional byproduct of asking Siri to perform a task and having that request fulfilled. The appropriate paradigm is services. ((Incidentally, services are the one thing I want more than anything else on the iPhone today. Developers have hacked around this with custom URL structures, but it’s no substitute for the real thing.)) Instead of registering applications, developers would register a Siri service with Apple. The end user would navigate to a special section of the App Store that housed only VUI services. It’s Newsstand for Siri!
Maybe Tapbots wants to make a Siri service. Services (unlike applications) are able to be used instantly (within Siri) by simply stating the service name plus the desired action. There is no launching a service. “Use Tweetbot to read me my tweets.” Siri answers, “I am loading your latest tweets.” ((While it’s important to be generous in what Siri can accept, certain components are essential to accomplishing the desired task. At minimum, we need to include the name of the service (Tweetbot, “subject”), the intended action (read, “verb”) and the object of the action (tweets, “direct object”). Other modifiers can also be supported.))
Once Siri begins reading the tweets, we should expect her to pause after each tweet to allow us the opportunity to respond. Unfortunately today that means pressing the microphone button on the screen. If Siri is to achieve its true potential, we’re going to need to be able to invoke it by just saying “Siri!” and, nearly as importantly, we need to be able to interrupt it. ((This is no small challenge. Our phones would need to be constantly listening for this keyword which is battery killer. At this point, we’re basically talking about the including all the computational power of Apple’s data center in a hand-held device. We’re not even close.))
At this point we might say something like: “That’s funny. Let’s star that tweet.” Behind the scenes, Siri is magically parsing my cryptic human language. As we’re in the Tweetbot context, Siri knows to interpret these commands against the Tweetbot provided options. “Star” plus possibly a dozen other words can perform the same action. It might also accept “like”, “favorite”, “heart”, “save”, and more. It’s also going to need to understand the word “that”. For Siri, “that” can mean a lot of different things. Here it’s critical it means “the thing we were just talking about”. It also needs to ignore “that’s funny.”
What happens if Siri doesn’t understand? Well, at first Siri should probably break out of context to see if there are any alternative means of fulfilling the query. If not, Siri already has error handling, she says, “I’m sorry, I don’t understand”, or some such euphemism.
Back in the narrative, we’ve starred the tweet. Siri either continues to read the tweets automatically or needs to be re-engaged by us. Let’s be explicit, “Siri, resume reading the tweets.” “Resume” or “continue” should always restart the previous task. Siri moves on to the next tweet, but by this time we’re bored. We say, “Read tweets from my sports list.” The keyword “list” needs to be interpreted as a Tweetbot command. The name of the list needs to be processed, but at this point, we’re right back where we started. Even a slight variation, however, could have radically different results. What if we said instead, “Read tweets about sports”? In that case, Tweetbot might query the Twitter API for the tag “sports” or it might even have a dictionary of sports-related terms if the data were pre-structured.
Voice-driven user interfaces were fantasy or science fiction at best. Now, we have one that works reasonably well within a narrow enough context. Even better, Siri is available on the computer we carry with us all the time rather than the one sitting on a desk. Yet, for now, the magic actually takes place not on this pocketable device but instead on battalions of servers in a distant data center. The delays we experience while using Siri are crucial. Audio files of the sounds recorded by Siri as we uhm and uhh our way through asking her to do us a favor need to be shipped across the Internet, processed into her best guess at the words we intended to communicate, submitted to her vast database for comparison with all possible ways we could have asked her assistance, and, eventually, offered back to us as a discrete action she is able to take on our behalf.
That Siri works at all is a tribute to modern advancements in processing strength, power consumption, and network speed and ubiquity ((Or have we now moved from ubiquity to invisibility?)). That Siri is not yet the omnipresent, omniscient, omnicapable Computer of Star Trek is in all likelihood a difference in scale not kind. It is not unthinkable to imagine a future only a few years from now in which a device the size of the iPhone can remove the quirks and sources of friction we currently experience. With better batteries, more storage, faster processors, smarter algorithms, and speedier connections, it may not guaranteed to happen, but who will deny the realistic possibility?
This is a revolutionary interface. We’re not going to get by using our hard-earned graphical instincts. The Herculean task facing Apple is educating developers on how to write a Siri service. Making Siri work with Apple’s internal services was no doubt difficult—as evidenced by the frequent down time and the relatively few available features. Enforcing this level of conceptual change on external developers is almost unimaginably hard. It may not even be possible. Apple may decide to keep Siri in-house indefinitely, slowly expanding the available services. I could live with that. It already makes my life much easier in many ways. But I know we’re all just dying to see the full potential realized. For that to happen, Apple need to unleash this force by enabling third-party development. The only way this works, however, is to conceive of it as a completely separate interface not handicapped (or propped up) by the existing iOS interface paradigms of a home screen, little icons representing applications, gestures and the rest. The new interface is the Siri voice and what can be shown within the Siri application. Applications are now simply services of Siri. And Apple is going to need to drill the concepts of VUI into developers who have never dreamed of such a thing. Remember the HIG? That’s going to be big again. Just like the release of the Macintosh required developers to learn and accept GUI principles, Siri redefines what it means to use a computer, and that means grokking VUI from the ground up. ((I have chosen to focus on what I believe Apple may have in store for Siri and, also, what the perfect voice user interface looks like. It’s entirely possible that many good or at least interesting VUIs could be designed to supplement the traditional graphical user interface. Unfortunately, companies can generally only really go in one public direction at any given time. Perhaps Google, Microsoft, RIM, and HP can take up the gauntlet for bringing innovative voice features about in other ways.))