Google Works Out a Fascinating, Slightly Scary Way For AI To Isolate Voices In a Crowd (arstechnica.com) 45
An anonymous reader quotes a report from Ars Technica: Google researchers have developed a deep-learning system designed to help computers better identify and isolate individual voices within a noisy environment. As noted in a post on the company's Google Research Blog this week, a team within the tech giant attempted to replicate the cocktail party effect, or the human brain's ability to focus on one source of audio while filtering out others -- just as you would while talking to a friend at a party. Google's method uses an audio-visual model, so it is primarily focused on isolating voices in videos. The company posted a number of YouTube videos showing the tech in action.
The company says this tech works on videos with a single audio track and can isolate voices in a video algorithmically, depending on who's talking, or by having a user manually select the face of the person whose voice they want to hear. Google says the visual component here is key, as the tech watches for when a person's mouth is moving to better identify which voices to focus on at a given point and to create more accurate individual speech tracks for the length of a video. According to the blog post, the researchers developed this model by gathering 100,000 videos of "lectures and talks" on YouTube, extracting nearly 2,000 hours worth of segments from those videos featuring unobstructed speech, then mixing that audio to create a "synthetic cocktail party" with artificial background noise added. Google then trained the tech to split that mixed audio by reading the "face thumbnails" of people speaking in each video frame and a spectrogram of that video's soundtrack. The system is able to sort out which audio source belongs to which face at a given time and create separate speech tracks for each speaker. Whew.
The company says this tech works on videos with a single audio track and can isolate voices in a video algorithmically, depending on who's talking, or by having a user manually select the face of the person whose voice they want to hear. Google says the visual component here is key, as the tech watches for when a person's mouth is moving to better identify which voices to focus on at a given point and to create more accurate individual speech tracks for the length of a video. According to the blog post, the researchers developed this model by gathering 100,000 videos of "lectures and talks" on YouTube, extracting nearly 2,000 hours worth of segments from those videos featuring unobstructed speech, then mixing that audio to create a "synthetic cocktail party" with artificial background noise added. Google then trained the tech to split that mixed audio by reading the "face thumbnails" of people speaking in each video frame and a spectrogram of that video's soundtrack. The system is able to sort out which audio source belongs to which face at a given time and create separate speech tracks for each speaker. Whew.
Dueling pundits? (Score:3)
Might be useful for sorting out what political pundits are saying when they try to overspeak each other.
Re: Dueling pundits? (Score:3)
I think they should actually duel.
Re: (Score:2)
sorting out what political pundits
That's a terrific idea! I can just mute the voices of people I disagree with! One more brick on my echo chamber.
Voice prints for ads (Score:2)
Get a mic, webcam placed in a persons home as a must have trendy free network service.
Track who exactly was talking about a dog, cat in a conversation at a friends home.
All their friends are now on file as seperate people to track and create ads for.
The pet food brands now have the resulting digital product lists sold to them.
The resulting ads start and the reaction of the users is tracked.
Re: (Score:2)
And the award for the most insightful, succinct post of the day goes to AC. (what a waste)
Re: (Score:2)
He has a very good point. How long until we live in the United States of Google. Or, The European Google, or Siberian Google, they are trying to take over and brainwash the world. Its like the Futurama Brain Slug episode
Re: Wow (Score:2)
It would also be incredibly useful for transcribing videos for the visually impaired, or for indexing the text.
Re: Wow (Score:2)
These days that's not a bug, that's a feature. We're done with the digital age. We're into the surveillance age. At least for Winston, he knew the telescreen while likely not be watching everyone all the time. Nowadays, all the screens are always watching... and listening... to everyone.. all the time.
Simplest way which still involves AI (Score:3)
1) Employ a robot.
2) Instruct the robot to kill the people in the room, one by one, until the target voice is no longer heard.
Re: (Score:2)
2) Instruct the robot to kill the people in the room, one by one, until the target voice is no longer heard.
Dunno -- they might be mute with fear. And they might have been originally using an accent to fool you.. Better kill 'em all to make sure you got the right one.
Re: (Score:2)
Good call - I hadn’t thought of that.
So now all those "security" cams (Score:2)
in public spaces will be able to lip-read conversations.
What harm can come from that?
Found the speaker (Score:2)
It's Charlie McCarthy.
Dinosaurs will appreciate it (Score:4, Interesting)
Keep in mind that it's still in the alpha stage (Score:3)
So far it's only able to isolate Fran Drescher's voice in a crowd of Amish people. But they're improving it every day.
Re: (Score:2)
Once it reaches "beta," it will stay there for the next 10 years!
Re: Keep in mind that it's still in the alpha stag (Score:2)
The future has arrived - and it's totalitarian. Hurrah!
Independent Component Analysis (Score:2)
Old news with at least two audio tracks and no video clues.
http://cnl.salk.edu/~tewon/Bli... [salk.edu]
Single-channel separation of multiple sources
https://youtu.be/LuBer-0WmpQ [youtu.be]
Re: (Score:2)
Not just for Muslims anymore (Score:1)
I can see niqabs becoming a useful garment not just for Muslim women, but also for anyone of either gender who doesn't want Google or our spying corporate overlords to see.
Stenography (not steganography) (Score:4, Interesting)
Boo Boooo! (Score:2)
Re: (Score:2)
Why is this scary? Would a machine that could add one thousand numbers in one second be scary to someone in 1965?
When I first read the headline, my money was on "Google works out a fascinating, slightly scary way for AI to isolate voices in a crowd BY KILLING ALL THE OTHERS".
Disappointingly, that wasn't the explanation, and I feel comedic potential has once again been wasted!
The elusive Cocktail Party Effect (Score:1)
I remember clearly the first time I read about the "Cocktail Party Effect", thinking "oh, that must be the headache I get when I go to cocktail parties and try to talk to people, the intense feeling of frustration that makes me hate going to parties."
Imagine my shock when I found out that other people claim to be able to track a conversation partner's voice and understand what they're saying, even in an environment filled with the voices of other people! I refused to believe it. But then, talking to friends
Re: (Score:2)
Exact same thing here.
Re: (Score:2)
beyond panoptic (Score:2)
Big Brother Google is always watching.
Big Brother Google is always listening.
Useful for hearing-aid users (Score:2)
Can't help thinking about .. (Score:2)
Didn't end well for them.
Won't end well for us.