Become a fan of Slashdot on Facebook


Forgot your password?

RockBox + Refurbished MP3 Players = Crowdsourced Audio Capture 66

An anonymous reader writes "Looking for an inexpensive means to capture audio from a dynamically moving crowd, I sampled many MP3 players' recording capabilities. Ultimately the best bang-for-the-buck was refurbished SanDisk Sansa Clip+ devices ($26/ea) loaded with (open source) RockBox firmware. The most massively multi-track event was a thorium conference in Chicago where many attendees wore a Clip+. Volunteers worked the room with cameras, and audio capture was decoupled from video capture. It looked like this. Despite having (higher quality) ZOOM H1n and wireless mics, I've continued to use the RockBox-ified Clip+ devices ... even if the H1n is running, the Clip+ serves as backup. There's no worry about interference or staying within wireless mic range. The devices have 4GB capacity, and RockBox allows WAV capture. They'll run at least 5 hours before the battery is depleted (with lots of storage left over). I would suggest sticking with 44kHz (mono) capture, as 48kHz is unreliable. To get an idea of their sound quality, here is a 10-person dinner conversation (about thorium molten salt nuclear reactors) in a very busy restaurant. I don't know how else I could have isolated everyone's dialog for so little money. (And I would NOT recommend Clip+ with factory firmware... they only support 22kHz and levels are too high for clipping on people's collars.)" This video incorporating much of that captured audio is worth watching for its content as well as the interesting repurposing.
This discussion has been archived. No new comments can be posted.

RockBox + Refurbished MP3 Players = Crowdsourced Audio Capture

Comments Filter:
  • tl:dr Recipe for recording the audio of multiple individuals in a large crowd.


    Sandisk Sansa Clip+ MP3 Player - []
    Rockbox - []


    Install Rockbox (open source firmware for MP3 players) on the Sansa Clip+. Configure to record on the Sansa Clip+ microphone in .wav format. Give a Sansa Clip+ to every person you want to record the audio for. Have every person start recording at roughly the same time, leave for 5 hours.

    Gather all Sansa Clip+s at the end of the session, and extract the .wav file. 10-participants = 10-track equivalent audio recording of the session.

    Mix and fade between the tracks to isolate the audio of single conversations between participants.

    He basically has created a relatively inexpensive and reliable way to get this audio. Much like using multiple Go Pro cameras to record action of sports events beats out using professional equipment (and in some ways has become professional equipment). He's arguing that the Sansa Clip+ together with the Rockbox open source firmware, is a better solution than using professional radio mic's and then having recording equipment receive those signals and store them on disk for editing later.

    I've no idea how "crowdsourced" fits into this though, nor how this is anything more than an advert even though the solution is a little interesting. It's useful enough and potentially cheap that you might imagine giving everyone at a Ted one of these as the conversations caught off-record might be even more valuable than the sessions.

  • Re:Lots of work? (Score:3, Informative)

    by Anonymous Coward on Monday October 01, 2012 @04:31AM (#41511267)
    I hate to post links to commercial products in a technical discussion, but 3D capture of sounds (as in "you can focus in real-time at any point of a room and listen to whatever happens there) already exists: []

    See also "microphone arrays" on google. Plenty of research in the past decades and for the coming ones. []
  • Re:Lots of work? (Score:5, Informative)

    by bertok ( 226922 ) on Monday October 01, 2012 @05:32AM (#41511465)

    I've seen this MIT project [] before, but just like that product you linked, they all seem to be about "regular" arrays or arrangements.

    I'm thinking more along the lines of ad-hoc arrangements of microphones, which is more like what Photosynth does -- it arranges arbitrary photos together to make a 3D scene, instead of taking specific, precisely aligned photos.

    One interesting bit about the MIT project is that they have 1,020 microphones -- a world record -- generating 50MB/sec of data. A quick back-of-the-envelope calculation verifies that this represents 44.1Khz at 8 bits per sample. If you think about it, this amount of data is peanuts to a modern PC. Just one high-end GPU might have 200GB/sec of memory bandwidth and over 2 teraflops of processing power! This translates to about 38,000 operations per sound sample, in real time, at 32-bit precision. That should be enough to track moving sound sources, figure out what's an echo and what isn't, correlate sounds across multiple microphones, perform doppler-shift analysis, etc...

    Going to higher numbers of microphones ought to be easy, and could allow some fantastic applications, as well as some scary ones. There would be enough redundancy in the data to build a 3D scene with tracking of both moving sound sources and moving microphones. It may even be possible to determine room geometry, and the movement of large objects could be tracked based on their interaction with the sound field.

    One application I can think of would be for capturing sound during movie filming. Often, studios have to discard the recorded sound and re-dub everything because of background noises, but this kind of technology would allow the director to perform arbitrary filtering after-the-fact, comparable to the light-field cameras that allow "refocusing" after an image has been captured. An actors voice could be picked out and made louder, everything with a source "behind the camera" could be edited out, and surround sound effects could be generated from any scene setup.

  • Re:Lots of work? (Score:3, Informative)

    by gordm ( 562752 ) <> on Monday October 01, 2012 @10:47AM (#41513263) Homepage
    Have used PluralEyes but find not much harder to sync manually. Make 3 loud clapping sounds once recorders are all running, manually sync to that in timeline. The vast majority of the audio can't be put in sync manually because the audio is so different for each perspective (for 5 hours) compared to the 3 seconds where identical clapping can be heard. Ideally the devices are all activated & running (then you clap 3x) before the event starts, and deployed as needed. As opposed to starting them as they are deployed to collars.

COMPASS [for the CDC-6000 series] is the sort of assembler one expects from a corporation whose president codes in octal. -- J.N. Gray