Rethinking the User Interface of Tomorrow: Voice and Emotion Recognition as Auxiliary Channels
In this article, I would like to introduce my concept of a brand-new type of user interface for web, mobile, and desktop. Creating something new and innovative isn’t easy in the web. All types of interactions have already been defined and tested. However, I believe that this approach will work if things will go as they are going now.
My idea is about using the voice/video input as the auxiliary/optional channel in the user-system interaction. Such systems as Siri or Alexa are built upon dialogue concept where the voice channel is a primary one. In my idea, the system will user the voice input as a background channel and it will use the traditional UI for a context.
For example, if you exclaim “why?!” when something happens, the system will instantly show the icon (bouncing?) in the designated area, in one click from the explanations on why.
In this concept, the voice recognition component is optional. Do you remember the voice recognition elevator video? That is what happens when the voice channel is primary. If Siri or Alexa understands you on the first try, you are a king in other people eyes. If you need to repeat several times with different accents and in different ways… That’s what makes it so silly. In my idea, the voice channel will be in play only when it succeeds with extracting the useful data from your voice. Otherwise, it ignores everything you say. No need to repeat. No need to wait for an action. The user uses the reliable channel first as primary – your mouse, keyboard and your display.
According to my idea, the system should constantly listen to your voice, constantly recognizes your words and trying to change the navigation in the better way using the information received from the voice channel. In case it fails (noisy, too quiet, the words mean nothing to it etc.) nothing happens. However, if it succeeds, the recommended navigation items will be closer to the user. Certainly, the system won’t put red shoes among green ones if the user explicitly sets the filter “green”. This is a UI task on how to make this interaction smoother. For example, it can be clickable popup tooltips in the designated area.
For example, you opened a shoe e-shop trying to find new shoes. You found something, open the product cart and on the larger image and description, you see that the shoes made of real leather. Poor cows! I don’t like real leather! — you exclaim and return back to the homepage. The system resorts/rebuilds the product carousels on the homepage. The system also notifies you that leather shoes are hidden from the main page. In the product listings, such products are marked with the “leather” icon. You can click on “revert” if you want to back out. You can also accompany the navigation with the phrases “Like it” or “don’t like it”, or “Too expensive”, “I need a hat…. where it is..”, “interesting, let’s come back tomorrow”, and so on, and so forth. All these phrases will be interpreted and taken into account.
This concept will work with the desktop applications and mobile UI as well.
This is not rocket science. All technologies are available. This idea can be easily implemented as a working prototype for field testing.
Next logical step is using a web-camera channel to leverage the eye-tracking analysis and emotion recognition to retrieve the information about hidden user intentions and possible preferences. Mouse tracking will help with it as well.
Certainly, nobody feels good about sending such information to the internet servers for processing. The machine learning component should be closer to the user, it should be part of the browser or OS. The cloud service may be leveraged for resource intensive analysis.
What do you think?
© Rauf Aliev, June 2017