Forgot your password?

Storing CERN's Search for God (Particles) 154

Posted by Zonk
from the she-is-in-the-details-or-so-i'm-told dept.
Chris Lindquist writes "Think your storage headaches are big? When it goes live in 2008, CERN's ALICE experiment will use 500 optical fiber links to feed particle collision data to hundreds of PCs at a rate of 1GB/second, every second, for a month. 'During this one month, we need a huge disk buffer,' says Pierre Vande Vyvre, CERN's project leader for data acquisition. One might call that an understatement.'s story has more details about the project and the SAN tasked with catching the flood of data."
This discussion has been archived. No new comments can be posted.

Storing CERN's Search for God (Particles)

Comments Filter:
  • News for Nerds! (Score:5, Insightful)

    by KlomDark (6370) on Saturday July 21, 2007 @12:23AM (#19935573) Homepage Journal
    Wow! Actually geeky science news, not enough of that here lately!
    • by xxxJonBoyxxx (565205) on Saturday July 21, 2007 @12:34AM (#19935629) Actually, it's a product placement PR piece about Quantum's StorNext. (Read page 2...)
      • by Anonymous Coward on Saturday July 21, 2007 @01:55AM (#19935905) Actually, it's a product placement PR piece about Quantum's StorNext. (Read page 2...)
        We knew there were some serious nerds on Slashdot, but to be potential customers for the same RAID system as CERN, whoa! :)
      • by pipingguy (566974) *
        According to a guy that I met yesterday on the street (he was talking to himself or somebody) the only way I could meet God (and hopefully His particles) was through his son. WTF? Can't even *God* get a good secretary these days?
      • by Midnight Warrior (32619) on Saturday July 21, 2007 @02:16PM (#19939463) Homepage

        You may think of it as product placement, but I use it. I even provide the occasional blog entry [] on it on Advanced Topics. I sat through a RedHat performance tuning class that was quite excellent. But when they came to the part about ext3 and tuning it, well, let's face it - ext3 just isn't going to scale. I started with Veritas' Filesystem which is pretty nice. If you're a small-time admin, then you never get beyond a local, 4U disk array. Once your group spends more than US$2million on servers though, it's obvious what the problem is: Storage - The Final Frontier. SAN and clustered filesystems allow a level of scalability completely unheard of before.

        They also completely left out anything but a tagline of their multi-tiered solution. I wish they'd talked more about how CERN supports 500Gbit per second aggregate throughput to their disks (at least they implied that). 50GB/sec (or so) is probably the toughest I/O problem you've ever dealt with, or will deal with for a long time. Whose RAID controllers did they use? Did they focus on speed (ASIC and ISL minimization), availability (redundant fabrics), or both? Did each node get dual 4Gb links or just one?

        If this had been an advertisement, they would have discussed some 3.0 features like LAN clients.

        So, in short, it's easy to say it sounds like an advertisement. Quite possibly, Quantum (formerly ADIC) coerced them into getting the piece written. But if this had been an advertisement, there is so much more that is going on under the hood that would have been said. Large, fast, distributed filesystems are non-trivial and take an extreme amount of engineering and testing. StorNext really is good at what they claim to do.

        If you want to read about some of the drawbacks though, I yak about them on my blog []. Sorry for the plug.

        • by jesboat (64736)
          50 GB/s = 50*8 Gbit/s = 400 Gbit/s.

          400 is nowhere near the order of magnitude different from 500 as your choice of units imply.
    • Re:News for Nerds! (Score:4, Interesting)

      by zeugma-amp (139862) on Saturday July 21, 2007 @12:36AM (#19935641) Homepage

      Interesting article.

      Many years ago when the SSC (Superconducting Super Collider) was still being built in Texas, I went to an HP users group meeting as I was working primarily with HP-3000 systems at the time. The fellow addressing the meeting was the head of the physics department at the SSC. It was a really neat presentation, in which he described a similar, though orders of magnitude smaller data storage requirement, though he was talking terabytes of data per month IIRC. At the time, they were planning on using two arrays of 40 workstation computers to handle the load. This would have been fairly early loosely coupled setup similar to a Beowulf cluster.

      After the presentation I went up to him and told him that all I wanted to do is sell him mag-tapes.

      These types of experiments evidently produce tons of data. I wonder if the processing could be parcelled out like Stanford's Folding@Home or SETI to speed up data correlations.

      • Re: (Score:2, Interesting)

        by Anonymous Coward
        It is! Sorta, at least... On my experiment (CMS), data gets a first pass handling on site at CERN, then gets parceled out to about 7 other sites (of which Fermilab is one) where their section of data gets another look. Each Tier 1 station, as it's called, also services requests from affiliated research institutions, both to get reconstructed data, and also to run and store their simulated data.

        It's a really neat system that makes the geek in me happy =)
      • Re: (Score:2, Interesting)

        by 32Na (894547)
        The folks at CERN maintain a set of libraries for analyzing nuclear and high-energy physics data sets, known as 'root'. These also include the Parallel ROOT Processing Facility, or PROOF []. I'm guessing that PROOF will play an important role in the analysis of this experiment once it comes online.
    • Re: (Score:3, Insightful)

      by Anonymous Coward
      ive often wondered if i could sneak into cern and just look around. i think the only two things you would need to do it would be a white lab coat and a really grizzled look on your face.

      i remember when i was under 18 i used to go to alot of places i wasnt allowed in just to check things out. i wasnt a malicious kid that would run around breaking things for fun, i just loved seeing various things that most people never see or think about, especially feats of engineering.

      when i turned 18 i looked back and was
    • Re: (Score:2, Interesting)

      by FractalZone (950570)
      I'm not so sure about the "huge disk buffer". Smaller disks can be spun faster and tend to have lower latency. I'd like to see the drum drive make a comeback for disk cache...expensive, but fast!
    • Re: (Score:2, Funny)

      by uolamer (957159) * /~~--allah-was-here--~~/0day/
  • by deopmix (965178)
    I don't precisely think that CERN is going to be purchasing thousands of dell PCs to analyze the data that they collect. maybe they are talking about a distributed computing project?
    • Re: (Score:1, Redundant)

      by SnoopJeDi (859765)
      From TFA:

      The ALICE experiment grabs its data from 500 optical fiber links and feeds data about the collisions to 200 PCs, which start to piece the many snippets of data together into a more coherent picture. Next, the data travels to another 50 PCs that do more work putting the picture together, then record the data to disk near the experiment site, which is about 10 miles away from the data center.
    • Re: (Score:2, Funny)

      by Anonymous Coward
      Actually their plan is to store all that data on Commodore 64 cassette tapes.
    • Re: (Score:2, Informative)

      by Falstius (963333)
      Actually, there really is a gigantic room at CERN full of commodity PCs that form the first level of computing for the different experiments. The data is then shipped off to sites around the world for further processing. There is a combination of 'locally' distributed computing and world-wide grid being used.
  • If Only... (Score:4, Funny)

    by i_ate_god (899684) on Saturday July 21, 2007 @12:32AM (#19935613) Homepage
    If only I could get porn that fast

    there I said it, let's move on now.
  • I think I just creamed myself. The hardware needed to push that much data must be insane!
    • Re: (Score:3, Interesting)

      by dosguru (218210)
      A standared dual CPU dual core HP server with Windows can keep a 4Gb FC pretty full if set up correctly. I work for a large bank, and we have many a Solaris box that can keep 4 or even 8 2Gb FC cards full into our FC and SATA disk arrays. Not to trivialize the extreme coolness of what they are doing at all, but a PB of data with a few PB of I/O in a day isn't what it used to be. I'm just glad to see they don't use Polyserve, it is worthless for clustering and has caused more downtime at work than it has
  • 2,629,743 seconds in a month, so... 2,629,743 GB or 328,717 GB?

      It's too late to do math.
    • Re: (Score:2, Informative)

      by snowraver1 (1052510)
      2.6 Petabytes. The article says that they will be collecting petabytes of data. Also, the article clearly said GB. GB= Gigabyte Gb= Gigabit. The thing that I thought was "Wow that's ALOT of blinking lights!" Sweet!
      • Yeah that's why I was confused they had a big B but if you're talking network speed it's usually described in Gigabits, small b.

        "In total, the four experiments will generate petabytes of data."

          Divide at least 1 PB by four and you get 256 TB, I was close with 328 TB, so it must be Gigabits.
    • by this great guy (922511) on Saturday July 21, 2007 @02:46AM (#19936089)

      Assuming a non-RAID 3x-replication tech solution (what Google do in their datacenters), using 500-GB disks (best $/GB ratio), they would need about 16 thousands disks:

      .001 (TB/sec) * 3600*24*30 (sec/month) * 3 (copies) * 2 (disk/TB) = 15552 disks

      Which would cost about $1.8M (disks alone):

      15552 (disk) * 110 ($/disk) = $1710720

      Packed in high-density chassis (48 disks in 4U, or 12 disks per rack unit), they could store this amount of data in about 30 racks:

      15552 (disk) / 12 (disk/rack unit) / 42 (rack unit/rack) = 30.9 racks

      Now for various reasons (vendors influence, inexperienced consultants, my experience in the IT world in general, etc), I have a feeling they are going to end up with a solution unnecessarily complex, much more expensive, and hard to maintain and expand... Damn, I would love to be this project leader !

      • by bockelboy (824282)
        So, 30 racks per month ... for a 15 year project. Say you only buy the first 5 years worth of disks - a simple 1800 racks.

        The LHC went with a tape-based, distributed storage system. Seven T1 sites around the world keep the data on tape (one copy at CERN, another copy at a T1 site). They do reconstruction of the raw data, and write the reconstructed data on disk. They then distribute the reco data to a T2 site, which has a large amount of disk-only space (like you suggest). The individual physicist does
    • Re: (Score:2, Funny)

      by Anonymous Coward
      "2,629,743 seconds in a month, so... 2,629,743 GB or 328,717 GB?"

      If they were smart, they'd choose February. They could save ~172800 seconds and therefore some disk space!
  • based on 1GB/sec * ((3600 * 24) * 31) means over 2.5 Petabytes.
    Something like 3000 of the current ITB drives.
    How long until Exabyte level storage is required for some project or another?
    • Re: (Score:2, Interesting)

      Estimates are that the four LHC experiments will produce about 15 PetaByte/year. The LHC will be online for about 15 years (maybe more). All data is kept permenantly. This means that there is a fail-safe copy stored at CERN on tape, which is a big task to perform constently. But that data is not worked on there, it is spread through the huge tubes of the academic fibers to big data centers around the world. All that online copy is replicated and is stored at two geographical locations. At each location most
  • FTL (Score:3, Funny)

    by unchiujar (1030510) on Saturday July 21, 2007 @01:09AM (#19935771)
    "Due for operation in May 2008, the LHC is a 27-kilometer-long device designed to accelerate subatomic particles to ridiculous speeds, smash them into each other and then record the results."
    Next up ludicrous speed []!!! Better fasten your seat belts...
    • Due for operation in May 2008, the LHC is a 27-kilometre-long device designed to accelerate subatomic particles to ridiculous speeds

      Actually it would be better to say "ridiculous energies" because the speed of the protons in the LHC will barely be any faster than those in the Tevatron...but the energy is seven times larger thanks to relativity.
    • Are there any practical applications of this research in technology? And what will this research tell us about the universe?
  • by Anonymous Coward on Saturday July 21, 2007 @01:11AM (#19935779)
    Hmm, lets see. ~2700 TB of data over one month. Let's store it on 500 GB drives. That's 5400 disk drives just to store the data. Add in the the extra drives for parity, and a few hundred hot spares, this thing could easily use OVER NINE THOUSAND drives.
    • by noggin143 (870087) on Saturday July 21, 2007 @02:42AM (#19936065)
      We are expecting to record around 15PB / year during the LHC running. This data is stored onto magnetic tape with petabytes of disk cache to give reasonable performance. A grid of machines distributed worldwide analyses the data. More details are available on the CERN web site

      • You ought to have Google store that data for you. Seriously.

        Google has collaborated on other scientific projects before, and one in particular has many of the same needs as the LHC, the LSST []. Of course, it doesn't hurt that one of the primary backers of the LSST is an ex-Google exec.

        I'm confident that Google is capable of dealing with large data stores, even those on a multi-PB scale, with reliability and redundancy.
    • Re: (Score:3, Insightful)

      How much is a 500Gb drive worth nowadays? 150$? So your OVER NINE THOUSAND drives are worth about, hum....1.35M$. CERN has a budget of about 5B$. It's the speed at which data is coming that's a problem. Not the total amount of data.
      • by OzRoy (602691)
        $150 for a SATA disk maybe. These would be Fiber Channel whose price would easily be about $600 per disk
        • Just because the data goes through fiber doesn't mean the disks have to be FC disks. Actually that's usually a bad idea, as the physical operations of the HDD (moving heads around and such) limit the rate at which the disk can actually accept data to much less than FC speeds.

          I have a couple local RAID boxes here and I feed data into them through 4Gb fiber channel, but the box just consists of 16 run-off-the-mill SATA drives in a RAID5 config (yielding 14 data disks (plus 1 parity plus 1 hot spare) times 7

  • 1GB/s * 1 month = 1GB/s * 30 day/month * 24 hour/day * 3600s/hour = 2,592,000 GB.

    A big disk (Seagate ST3750640AS) is 750GB.

    324,000 GB / 750GB/disk = 3,456 disk.

    At AUD467 per disk this will cost AUD1,613,952 (plus computers+net). Even cheaper if you allow for the fact these are retail
    prices for wholesale quantities. Let's take the startup current of 2A@12V as the worst case power
    consumption and we end up with a maximum power of 83kW. That's less than 35 domestic heaters (2.4kW ea).

    Okay, it's not trivial s
  • 'During this one month, we need a huge disk buffer,' says Pierre Vande Vyvre, CERN's project leader for data acquisition. One might call that an understatement.
    I expect he referred to the problem of finding the God Particle as "distinctly non-trivial".
  • Fun problem (Score:2, Insightful)

    by bob8766 (1075053)
    The network is one thing, but just processing that amount of data is incredible.

    200 computer breaks the 1GB chink into more manageable 5MB/Sec chinks of data, but then they still need to handle the metadata that figures out how to put it all back together. On top of this they'll need to have some redundancy in case of data loss, and how the load is redistributed if a machine croaks.

    These are good problems, it would be a fun system to work on.
  • Not So Huge (Score:5, Informative)

    by PenGun (794213) on Saturday July 21, 2007 @02:45AM (#19936083) Homepage
    It's only 5x HD SDI single channel ~ 200MB/s. Any major studio could handle this with ease.

    SDI is how the movie guys move their digital stuff around. A higher end digital camera will capture at 2x HD SDI for a 2K res, 4:4:4 colour space. A few of em' and you got your 1GB/s easy. Spools onto godlike RAID arrays.

      Get em' to call up Warner Bros if they have problems.
  • 1GB/sec is 3.6TB/hour, or 86.4TB/day, or 2.5PB in a month. That's really not all that huge for enterprise or scientific storage. I see that all the time in hosted environments.
  • by Nom du Keyboard (633989) on Saturday July 21, 2007 @03:52AM (#19936285)
    Just e-mail it all to Google. By then gMail should be able to handle that much per user.
  • by torako (532270) on Saturday July 21, 2007 @04:01AM (#19936309) Homepage
    It's important to distinguish between the amount of data generated during an event right in the detector and the filtered data that in the end will be kept and saved on permanent storage. The ATLAS detector, for example, has a data rate in the order of terabits per sec during an event. There's a pretty sophisticated multi-level triggering system whose purpose it is to throw out most of that data (~98%) and only look for interesting events.

    Right now, the average event size for ATLAS is 1.6 MByte and the system is designed to keep around 200 events per second, or roughly 300 MByte. This isn't much of course, but you have to consider that the bunch crossing rate (i.e. the rate at which bunches of protons will collide and generate events) is 40 MHz.

    So you have to design a system that boils this rate from 40 MHz down to 200 Hz and only keeps the interesting parts, while also buffering all the data in the meantime. For this reason, the first trigger level is entirely implemented in hardware right in the detector and reduces the rate down to 75 KHz with a latency of 2.5 s. The rest of the trigger works on clusters using Linux computers and has a latency of o(1s).

  • Better yet... (Score:2, Interesting)

    by curryhano (739574)
    ...all this data will be distributed to a handfull of TIER1 sites (CERN is TIER0) all over the world (about 10). At the TIER1 sites the data will be preprocessed. The TIER1 sites distribute their preprocessed data to TIER2 sites which are the places where the international scientists work. I work at a TIER1 site and we face a lot technical challenges with this project. At a TIER1 site as I mentioned, the data is preprocessed too, so we will need a compute cluster and the necesary bandwith internally to mo
  • TFS makes a point about storing 1 GB (presumably GigaBYTE) of data per second, but THAT feat is already in widespread use, spefically for the digital manipulation of 4k film. The company that produces the systems that process this film data is called Baselight [].

    Basically, 4k film, at a resolution of 4096x3112, requires approximately 50MB per frame @ 24 fps. That comes out to about 1.22GBps, and maninuplating the data doubles it to 2.44GBps. The systems[PDF] [] that Baselight sells run 8 nodes and 16 process
  • I am really surprised they did not use the Lustre filesystem [] for their data storage since it is vendor neutral, open, and designed for exactly this sort of thing. The lustre guys report being able to obtain tremendous bandwidth and scalability. I have not yet been able to play with Lustre but I look forward to doing so.
  • by Roger W Moore (538166) on Saturday July 21, 2007 @04:48AM (#19936423) Journal
    The ALICE experiment is actually concentrating on heavy ion collisions which is why they only worry mainly about one month/year, the rest of the time the machine is running protons for the other experiments, ATLAS and CMS, which will look for the Higgs. ALICE will hopefully study the quark gluon plasma but, as far as I know, has no plans to look for the Higgs.
  • OK, we got a half way overview of CERN's decision, with some bold statements of questionable validity. I am submitting the criticism purely on the grounds of being really interested in large data storage, I don't work for any large storage vendor, but I am an architect of storage systems.

    First of all, with the statement "and it's (StorNext) completely vendor independent": Lot's of other solutions provide flexibility about choosing the hardware vendor from a theoretical perspective. The theory says that if

  • ...just because a SAN is connected at 1Gbit to a machine does not mean there is 1 Gbit of data passing over there all the time.

    If I were to write up my house network I could say 'network switches feed data to several computers at 1Gbit per second' - this would be true if I only use it for web browsing - doesn't mean I'm saturating my bandwidth.
  • by Mostly a lurker (634878) on Saturday July 21, 2007 @08:26AM (#19937207)
    I assume they will want to have more than one copy of this for backup purposes. Here is my analysis on their choices. The total data to be backup up (for the month) is taken as a lazy 1 * 60 * 60 * 24 * 30 = 2,592,000 gigabytes
    • Printed hardcopy. Many authorities recommend this as you do not need to worry about changes in data formats over time. For exact calculation, we would need to know the font they were planning to use and the character encoding. However, let's take a working assumption that they can cram 10KB of data onto an A4 sheet. That implies 259,200,000,000,000 pages. They will probably not want to use an inkjet printer if they use this solution and may, indeed, choose to acquire multiple printers and split the load. A single printer at 10 ppm would take approximately 50,000 years to complete the backup. On 70gm paper, it would weigh a little over two million tons. At any rate, this would certainly produce reams of output.
    • Diskettes. This was good enough for nearly everyone 15 years ago. It is curious that such a tried and trusted technique is no longer in fashion. I assume regular 3.5" 1.44MB diskettes, generally recognised as easier to handle than 5.25". We shall need around 1,800,000,000 diskettes. One drawback is the person changing the diskettes as each one filled up might become a little bored after a while. On the positive side, the backup will be quite a lot faster than the printed solution. Assuming about one diskette per minute, inclusive of changing disks, the backup could be complete in less than 3,500 years.
    • Now considered somewhat old fashioned, punch cards were once a mainstay of every programmer's personal backups. Like printed hardcopy, anyone familiar with the character encoding used, could read the data without needing any access to a computer. If we assume 80 column cards, we would need 32,400,000,000,000 cards. I would be somewhat concerned about the problem of getting this stack of cards back in the correct order if I dropped it. With a weight of about 30 million tons and stretching perhaps 6 million miles end to end, handling certainly would be challenging and an accident very possible.
    • Paper (punched) tape was the only alternative on the first computer I used, a basic early model Elliott 803 without the optional magnetic tape. If I recall correctly, you could manage about 10 characters per inch, so you would need a paper tape over 4,000,000,000 miles long. Hmmm, that would be silly. The other solutions are clearly better.
    I am sure other options will be considered, but I just wanted to bring these up in case CERN had failed to consider them
    • by ookabooka (731013)
      Why not just have volunteers remember the data? If you made a linked list of individuals so that each individual would remember the name/face of the individual after him and also either a 0 or a 1 representing the data he stores. By doing this you would need just under 21 quadrillion people (20,736,000,000,000,000 people to be exact). A doubly linked list would only require that each individual remembers 2 people (before as well as after) which is quite managable. The number of people required obviously goe
    • Re: (Score:2, Funny)

      by fatphil (181876)
      Nice figures. If they did use 3.5" diskettes, then they'd have to write 1000/1.44 per second or roughly 700/s. Assuming they could be written to instantly, they'd need to move through a single drive at 700*3.5"/s = 224km/h. Assuming you need to get them stationary to write to them, then they'd need a maximum speed of 448km/h to keep up the mean speed. Don't stand in their way...

      Of course, the tower of floppies for each day would be 151km high...

      No, I don't know what that is in football fields.

Porsche: there simply is no substitute. -- Risky Business