Amazon Alexa: 'Pre-wakeword' patent application suggests plans to process more of your speech
Speaker identification and more listening in: welcome to the future of voice assistants
Amazon has submitted a patent application which suggests the firm plans to capture your speech and send it to the cloud for processing before as well as after hearing a "wakeword" trigger.
The usual deal with these devices is that they sit quiescent until trigged by the wakeword, in this case typically "Alexa", and then parse your subsequent speech by sending it to the cloud. Of course Alexa has to listen all the time for the wake word, but other speech is normally ignored. "When you want to communicate with your Alexa devices, just say the wakeword," says Amazon's help.
As the January 2019 patent application noted, "privacy concerns may make it undesirable for a local device to send all captured audio to a remote device". Still, it seems unhelpful to restrict processing to cases where the user says "Alexa play music" rather than "Play music Alexa", for example.
The application therefore proposes that the device should buffer speech recording in case you might be about to say the wakeword:
"To perform speech processing with a variable wakeword location, a system may be configured to buffer input speech. As speech is buffered, the system may insert markers or indications in the buffered speech that are likely to correspond to subject changes, which may in turn correspond to spoken commands. Further, as speech is buffered the system may perform wakeword detection. When a wakeword is detected, the system may look backwards to the most recent marker that preceded the wakeword. That marker represents a location where the command likely began."
There is more: Alexa may have eyes as well as ears.
The patent reads:
The device … may be a device that detects video and audio (such as a video camera, device with a camera and microphone, etc.) The device may also send that video data (which corresponds to the utterance that includes the wakeword) to the system for processing such as facial recognition, improved speaker identification, or other image processes.
The system may allow for the different speech characteristics of specific individuals, by performing speaker identification:
"The length of a pause sufficient to qualify the pause as the beginning of a new utterance (as opposed to the continuation of an utterance in progress) may depend on the identity of the speaker. If the system is configured to perform speaker identification (techniques for which are known in the art), the system may identify the speaker and adjust the pause length."
The patent application also goes into some detail about "beamforming techniques", which means using an array of microphones to get spatial as well as audio information (you can use the techniques with sound as well as radio waves). This can improve recognition in noisy environments as well as helping Alexa to track an utterance when the speaker is holding forth while walking around a room.
All of this makes perfect sense in the context of making voice assistants and voice control work better. The examples in the application are drawn from the reassuringly inconsequential topic of playing music by voice control.
Whether such common-sense techniques are or should be patentable is another issue.
The wider problem though is that having cloud-connected devices in homes, offices and hotels that listen and even perhaps watch everything you do, complete with identification of who is at home, puts great power into the hands of tech giants which already have too much. The potential for targeted advertising is huge.
The Wall Street Journal recently reported on how Zego, an Amazon partner, was installing its Alexa-based system in US rented apartments. "We can predict if residents are happy based on their digital interactions with the service, which gives us more information about whether they will renew their leases," said Zego CEO Adam Blake, quoted in the report.
Alexa has also been installed as a guest in hotel rooms via Alexa for Hospitality, with Marriott an early partner, though there is no suggestion that the hotel will be able to listen in.
Voice assistants are getting better and have obvious potential to improve our lives, but it will be down to the twin forces of regulation and responsibility to ensure that the intrusive aspect of the technology is not abused. History tells us that those two forces are often insufficient. ®