The Challenges and Threats of Automated Lip Reading 120
An anonymous reader writes: Speech recognition has gotten pretty good over the past several years. it's reliable enough to be ubiquitous in our mobile devices. But now we have an interesting, related dilemma: should we develop algorithms that can lip read? It's a more challenging problem, to be sure. Sounds can be translated directly into words, but deriving meaning out of the movement of a person's face is much more complex. "During speech, the mouth forms between 10 and 14 different shapes, known as visemes. By contrast, speech contains around 50 individual sounds known as phonemes. So a single viseme can represent several different phonemes. And therein lies the problem. A sequence of visemes cannot usually be associated with a unique word or sequence of words. Instead, a sequence of visemes can have several different solutions." Beyond the computational aspect, we also need to decide, as a society, if this is a technology that should exist. The privacy implications extend beyond that of simple voice recognition.
HAL 9000 (Score:5, Funny)
Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.
Re: (Score:2, Funny)
To everyone else: If something along these lines was NOT your first thought, please turn in your geek card.
Re: (Score:2)
Sorry, I was still stuck on the claims of reliability in the first line of the article. Now my trousers are damp and I must change them.
Re: (Score:2)
We already have laser microphones, which can detect sound vibrations at a distance, and we have sophisticated sound processing methods to extract weak signals from noise, etc. We don't need lip reading, other than maybe as a fun science project for graduates.
Re: (Score:2)
Re: (Score:2)
That's ridiculous. If you can lip read the reflection in a terrorist vid, then you can see the person's face, and you don't need to know what he's talking about, you can arrest him for being an accessory. If you can't see the person's face, try using Photos
Re: (Score:2)
Re: (Score:2)
Whatcha need... (Score:2)
...is a little monitor that hangs over your lips, showing a silent movie of your lips saying (in a loop) "I suspect I'm under surveillance" while underneath, you can be saying anything you like. :)
Re: (Score:1)
Re: (Score:1)
Why didn't Slashdot editors use the HAL eye icon (eyecon?) instead of the lock? I'm disappointed and will increase my trolling 35% in protest.
Comment removed (Score:3)
Re: Thanks Jerry Mahoney! (Score:1)
"I still can't allow you and Cmdr. Poole to jeopardize this mission, Dave."
Re: (Score:1)
Who said that?
NSA probably already has this technology (Score:1)
Re:NSA probably already has this technology (Score:5, Interesting)
I'd be very surprised if the false positive rate were as low as 1%. Lip reading is NOT an exact science. It depends on context, clear line-of-sight, and how well the speaker enunciates. You'd be amazed how many phonemes sound different to our ears but look identical on the lips.
But hey, I'll let these guys explain it much better. Bad Lip Reading [youtube.com]
Hilarious stuff, but the point is relevant: Without *any editing at all* of the actors' lips, they are able to perfectly match ridiculous words to those mouth movements. Why would automated software pick the "real" words over the BLR version?
Re: (Score:2)
Not at all useless. Simply decode all possible sequences and rank them, ranking the most self-consistent interpretation highest. You may also have other sources of data to help correlate the interpretation (there was an article earlier this year about measuring sound using the video footage of a mylar potato chip bag's vibrations.) Even if the room is crowded, it might be possible to identify a few isolated words from the audio recording of the conversation.
The next thing you do is throw away those conve
Re: (Score:2)
But neither should be able to get a warrant because of the inaccuracy but it will be presented as having no incaccuracies at all.
Re: (Score:2)
They don't need a warrant if they're not trying to gather admissible evidence. See "parallel construction" for an example of what they do with this data.
Re: (Score:2)
Why would automated software pick the "real" words over the BLR version?
Those BLR guys are going out of their way to produce something ridiculous.
You can train recognition software using real language samples and some grammar rules.
Why would you assume that we can't strap these two technologies together?
Re: (Score:2)
They did it as a humorous example, but there are many words which have negligible or synonymous lip movements. And as someone pointed out, ventriloquism is easily learned and *will* be as soon as authorities start using lip reading software.
Re: (Score:2)
Because we as humans weren't even able to do it. Cast your mind back to the 2006 World Cup finals when Zidane head-butted an opposing player.
Lip readers concluded a wide range of possible answers to why he lashed out including calling him a terrorist, insulting his mother, and saying his sister was a prostitute. This may be something specific to the Italian language that the words may sound the same but it highlights the problem. All conclude that he said the Italian for "go fuck yourself" at the end.
The pr
Re: (Score:2)
"Dude, you punched a f-ii-sh."
Frelling awesome!
The real point is, though, that although some of those redubbed conversations are like Jabberwoky, some exchanges are reasonable (and some are spot-on visual homonyms, like the fish interpretation above), demonstrating that lip reading is wildly underconstrained.
Re: (Score:3)
You can't for example tell the difference between "nine" and "ten" by lip reading, and often either could be equally likely in the context.
Re: (Score:2)
You perhaps can't distinguish nine from neun (german) or ten from zehn (german) but 9 from 10 in most languages is easy distinguished ... perhaps you just need practicing?
Re: (Score:2)
I spelt it out in words because I was talking about English. Obviously French is completely different.
When saying either of those words, first the tongue moves down from the top of the mouth, then you say a vowel, and the difference between them is at the back of the mouth where you can't see. Then you have the "n" which is the same in both words.
Re: (Score:2)
You can't definitively tell the difference between nine and ten. However, nine is generally a little longer and less abrupt. I'd guess that I could get over 90% accuracy lip reading people who are just saying nine and ten. General speech is a different matter.
Re: (Score:2)
According to the article that would be groups of 5 phonemes (on average) that look identical.
Re: (Score:2)
I can lip read a little (my hearing was awful as a child). I still always look at people's mouths when I'm talking with people to get extra information - my hearing's currently worse than average, but not too bad - I have trouble with background noise.
There have been some times watching quiz shows when I've read the contestant's lips (when they're conferring) to get the answer they're going to say before they've said it, and repeated it to the room. That being said, I agree it's far from an exact science.
Re: (Score:2)
You are making a fundamentally flawed assumption that the government cares about false positives. I think our no-fly lists, jails, and police militarization are a pretty good indicator that a low false positive rate does not figure into calculations as far as the NSA, TSA, DHS or other TLAs are concerned. A cynical man (or woman) may also wonder about whether true positive rate figures into their calculations at all as well, or whether a power grab is the sole purpose of these agenices.
Re: (Score:2)
Judging by the false-positives rate, a case might be made that they are in fact aiming for zero negatives.
Re: (Score:1)
The NSA could care less about false positives. They just mean the budget as to be upped that much more next year.
Jesus H Christ! (Score:5, Insightful)
We're all going to have to start wearing Burkas if we want any privacy at all.
Jesus H Christ! (Score:2)
We're trying to catch the terrorists, not dress like them.
Re: (Score:3)
More like CV Dazzle [cvdazzle.com]
A burkha will get you "profiled". Weird hair and makeup is a fasion statement.
Re: (Score:1)
We're all going to have to start wearing Burkas if we want any privacy at all.
No because a microphone will be on every corner. They'll have all the cases covered.
Re: (Score:2)
Cobra Commander was SO ahead of his time!
ps. Go Cobra!
Too bad (Score:5, Insightful)
Too bad it never stopped anyone before.
Re: (Score:2)
In the end, I suspect we'll decide that the advantages outweigh the disadvantages, and pass laws to protect people from the disadvantages. I'm not saying this will be ideal, but it will be the best we can do.
We have faced, or are facing the same issue with other technologies such as face recognition, profiling, genome sequencing, etc.
The legal system (Score:2)
If lip reading software reaches the courts, suddenly all video recording becomes wiretapping. The courts might resolve that by allowing audio recording wherever they allow video recording. Or by forbidding video recording wherever they forbid audio recording. Or maybe they will finally do something about that ancient "wiretapping" deal they've been twisting into the modern world.
Crap (Score:2)
It's a load of garbage anyway. There's nothing this technology does to invade privacy that we can't already do.
You're in the open, then use a parabolic mic to pick up the conversation you're clearly already taping.
You're behind some glass, then use a laser microphone to pickup the conversation which while it sounds James Bondish, actually already exists.
As a society we're already too little too late on the privacy side.
How Naive (Score:5, Insightful)
Beyond the computational aspect, we also need to decide, as a society,
Re: (Score:3)
If we don't get it, the terrorists will get it first.
Re: (Score:2)
Re: (Score:2)
But not before the marketing scum. There are already screens that advertise different things depending on your gender, determined by s little camera above it.
Re: This technology *will* exist... (Score:1)
There's lots of cameras deployed without microphones. Also pretty sure sound doesn't make it to geosynchronous orbit strata of the atmosphere...
Re: (Score:3)
There's lots of cameras deployed without microphones. Also pretty sure sound doesn't make it to geosynchronous orbit strata of the atmosphere...
You're implying we could read lips from GEO. Good luck with that. Even if the Hubble Space Telescope (which is at low earth orbit, not geosynchronous) were pointed at the earth, the best resolution you could manage would be about 30 cm.
http://www.spacetelescope.org/... [spacetelescope.org]
https://what-if.xkcd.com/32/ [xkcd.com]
In theory it might be possible to read lips at GEO, but you'd need a HUGE telescope, or smaller binocular-configured telescopes with a wide-enough baseline, to get the job done.
And nitpick: there's really no "st
Re: (Score:2)
Re: (Score:2)
Or Mig Jagger(sp?)
Why should it NOT exist? (Score:1)
Turning the question around, why should it NOT exist or be looked into? At the very least it's an academic curiosity. If privacy is a concern, there's a very easy way to break the algorithm - talk whilst covering your mouth, which people have been doing whilst whispering to others for a long time. Ventriloquists would probably defeat it easily as well.
Capture: Lunatic
Comment removed (Score:5, Insightful)
Re: (Score:1, Flamebait)
we are morally obligated to develop this technology before the bad guys get it and use it against us.
Re: Why should it NOT exist? (Score:4, Insightful)
Governments and corporations are fictional persons. They have no "moral consciousness" of any kind, outside of rhetorical and ideological fantasy.
So, this will not be a question of moral or immoral use. It will be amoral, in the hands of those who have advanced themselves through manipulation of the aforementioned ideological rhetoric.
You continue to believe that there is hope for this modern, post-industrial society. But there is none. We as people have increased the sophistication of our tools and our reach - just as relentlessly as we have avoided the refinement of our own beings.
In the end you don't get Star Trek. You don't even get Starship Troopers. You get Scanner, Darkly And hope there is Valis.
Re: (Score:2)
related dilemma: should we develop algorithms that can lip read? Of course we should, we should develop any tech. The real question is, will it be used for moral or immoral purposes?
Certain technology can be declared illegal. Like guns in certain countries. Radar detectors in some US states. Blue lights on non-police cars in most US states. Mechanisms for counterfeiting printed money. Cloning of human embryos. Et cetera. It's perfectly plausible for a society to declare some particular technology illegal.
Heck, even certain knowledge is illegal for the general public to own, let alone internalize, like plans to make nuclear bombs.
Re: (Score:2)
Heck, even certain knowledge is illegal for the general public to own, let alone internalize, like plans to make nuclear bombs.
Designs for nuclear weapons are not too hard to find online. The hard part (thank God) is obtaining the materials to make one, such as enriched uranium, plutonium, deuterium and tritium.
That said, I agree it would be illegal for a member of the general public to possess classified documents of any kind, without authorization.
Re: (Score:2)
Think of the advantages for the deaf and hard of hearing (combined with a HUD). That alone tells me we should develop it. NSA are gonna NSA. Terrorists are going to terrorize. This type of technology has the potential to change countless lives, and for that reason alone we should.
Re: (Score:3)
Grow a big moustache.
Pfft (Score:4, Insightful)
Like moral issues have ever stopped anyone. :(
Combined (Score:2)
The most obvious approach is to combine the 2 methods - much like humans do, especially in noisy environments. It might improve the accuracy of current speech recognition which is, too be honest, still sub-standard.
Speech recognition as is now is way too limited. Sure, Siri and the likes may work. And some computerized phone systems use it to nag us instead of using reliable button clicking. But it is still far from transcribing an accurate memo. Let alone automated subtitling or other fancy applications.
So
Re: (Score:2)
Err...
Yes - and if you actually read the article you linked, it's saying that if you edit the sound to be different than video, then you get effects that differ from the sound when listened to.
In real life - when the sight and video are not intentionally disturbed - it helps.
Re:Combined (Score:4, Insightful)
The most obvious approach is to combine the 2 methods - much like humans do, especially in noisy environments.
Right. Especially since, when you're looking at your smartphone, it's looking back at you.
This would be valuable for vehicle driver speech input, which has to reject a lot of noise.
There's already a textbook (Score:2)
The most obvious approach is to combine the 2 methods - much like humans do, especially in noisy environments.
Obvious, indeed. There's already a textbook [sciencedirect.com] for the subject, Multimodal Signal Processing [elsevier.com]...available for free online, no less.
This is exactly the sort of system you'd want on a flight deck, to supplement the accuracy of speech-recognition in the presence of noise, especially intermittent noise such as turbulence. It can also help with speaker identification.
As for the hopelessly naive idea that "society" should be able to choose whether this sort of thing should exist...the textbook came out in 2009.
HAL did it. (Score:1)
It will happen, it's just a matter of getting the tech correct.
It's already been decided.... (Score:3)
Beyond the computational aspect, we also need to decide, as a society, if this is a technology that should exist. The privacy implications extend beyond that of simple voice recognition.
How much do they extend beyond that of so called "simple" voice recognition? I suppose one could rarely listen in when they couldn't have with current amplifying audio equipment. As a society, we've already decided that it should exist: "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness."
Can this be used as a weapon? Yes, so can a hammer. Ban hitting people with hammers, not the hammer.
Re: (Score:2)
The problem in the United States is that corporations are legally people. The EU will clamp down on this hard, not allowing corporations to monitor any conversation in range for advertising purposes. Individuals will benefit (I'd love to be able to whisper silently to my phone instead of having to say "OK Google" out loud) but business use will be heavily regulated. New rules already allow for fines of up to 50%of global revenue for privacy violations.
In the US it will be a conditional issue and corporate l
Re: (Score:2)
In the US it will be a conditional issue and corporate lawyers/lobbyists will win. People won't speak in public for fear for the adverts they might trigger.
I doubt they will "win" like you suppose, They are to smart for that. Perhaps they should, and people may start to push back...
Hasn't someone already done this? (Score:2)
I seem to recall that this was done previously but the conditions had to be good (e.g. sitting facing the camera with good lighting.)
Easier than you think. (Score:3)
Lip reading is a lot easier than the original poster thinks. There is a lot more data available, especially within context.
Re: (Score:3)
Already being done... (Score:2)
Challenge (Score:2)
It's certainly a worthy area of computational linguistic research. But the reason for that is that it's a very hard problem. Automated language processing, with very smart people and very motivated spy agencies working very hard at it, has taken 60 years to get to a point not quite at the level of high school language speakers.
The privacy concerns are irrelevant. The deaf will demand this, and as long as there are weak-willed politicians and judges more interested in making political statements than disp
Moral conundrum? I don't think so.. (Score:1)
You sure that the NSA hasn't got it already? (Score:2)
It's going to be done anyways (Score:1)
You can bet your $THINGOFVALUE here that the CIA and similar organizations are already researching this if they don't have it already.
Like handwriting recognition this will be full of examples of "bad output" in the early days and there will always be cases where lack of context and/or deliberate obfuscation by the speaker makes this unreliable.
Let's just assume that this will be as reliable 5 or 10 years from now as automated face recognition is today and within 20 years both will be very reliable. What d
Challenging to sounds to discern visually (Score:2)
"society" doesn't get to decide (Score:2)
Sorry to break it to you, but society not only doesn't "need" to make this decision, it has no right to make this decision. You don't get to decide what other people invent, and for the most part not even what it is used for.
George Carlin (Score:2)
Here’s a good example of practical humor, but you have to be in the right place. When a local television reporter is doing one of those on-the-street reports at the scene of a news story, usually you’ll see some onlookers in the background of the shot, waving and trying to be seen on television. Go over and stand with them but don’t wave. Just stand perfectly still and, without attracting attention, move your lips, forming the words, “I hope all you stupid fu
Yes, develope for the disabled (Score:2)
I can see how this would be great for deaf people, using something like google glasses to get subtitles of convo's around them. How about making something for people who can't speak, but can form the words with their mouth, Might need something like a mic but with video/lasers for reading the facial movements, that outputs it to a speaker.
Sure it will get used for bad, but that is going to happen regardless anyways. So how about we do some good with it and help out the disabled people with some nice t
Augmented sensing (Score:2)
Could augment by adding other sensors such as microwave, laser or terahertz imaging, to detect signals being generated by tongue and vocal cords, or even to directly image the organs themselves.
Also it seems possible that since tge whole head vibrates, reflections or motions of eye, nose lips and forehead might provide vibratory cues.
There is no "Should we?" involved. We will. (Score:1)