Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Hardware

Elevation Plays a Role In Memory Error Rates 190

alphadogg writes "With memory, as with real estate, location matters. A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM resides can influence how many random errors the memory produces. In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes. Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect."
This discussion has been archived. No new comments can be posted.

Elevation Plays a Role In Memory Error Rates

Comments Filter:
  • Heat related? (Score:5, Insightful)

    by Anonymous Coward on Friday November 22, 2013 @11:00AM (#45491577)

    Top of the rack tends to get toasty, but is this too simple?

    • Top of the rack tends to get toasty, but is this too simple?

      I logged in to say that.

      It seems obvious -- heat rises, I would expect top of rack components to fail more often unless the cooling design is well done.

      Completely fabricated statistic: Only 10% of datacenters have proper cooling design.

      • Re:Heat related? (Score:4, Interesting)

        by AmiMoJo ( 196126 ) * on Friday November 22, 2013 @11:24AM (#45491889) Homepage Journal

        Vibration as well. The top of the stack moves quite a bit more than the bottom of the stack, even though the overall magnitude of the movement is small.

        • Although true, I don't imagine vibration has any effect on SRAM error rates. Hard drive failure rates, I could imagine (though that's a big stretch).

          I wonder if it has to do with the upper servers shielding the lower ones on the rack from the cosmic rays. Time for a tinfoil hat for my servers!

    • Re:Heat related? (Score:4, Informative)

      by spike hay ( 534165 ) <blu_ice&violate,me,uk> on Friday November 22, 2013 @11:09AM (#45491705) Homepage

      If it's cosmic rays causing a lot of the problem, the extra material of the racks above would make a difference.

      • That assumes that the rays tend to come down vertically. I don't know what the distribution would be, but I'd be very surprised if it was mostly vertical at any particular point on earth. So then it would depend on what the rays had to travel through to get to the memory chips. I'd further assume the computers were not exposed to the sky, so I remain skeptical of the cosmic ray explanation.

        It would be easy to test though. Have a rack of servers with only the bottom one turned on. Then move that server

        • Re:Heat related? (Score:5, Interesting)

          by DeathToBill ( 601486 ) on Friday November 22, 2013 @11:53AM (#45492233) Journal

          I was looking into RAM error rates a week or so ago. There's not a lot of research around, but I recall seeing suggestions that error rates were significantly smaller if the chips were mounted vertically rather than horizontally - because vertically mounted chips present a lower vertical cross-section and most error-inducing cosmic rays come at near-vertical inclination.

          • Right, cosmic rays have a hard time penetrating through too much matter, even air, so it makes sense. I've been reading articles about high energy neutrino detection and maybe confused the two just a little. I stand corrected.

            • Re:Heat related? (Score:4, Informative)

              by fnj ( 64210 ) on Friday November 22, 2013 @04:28PM (#45495335)

              Cosmic rays (they are actually particles, not electromagnetic radiation) cover a whole range of stuff, with individual particles varying extremely widely in energy content. Primary cosmic rays originate outside Earth's atmosphere. When they collide with the atmosphere, secondary cosmic rays are generated. Primary cosmic rays are mostly (99%) nuclei of various atoms. The remaining 1% are mostly free electrons (beta particles). In turn, 90% of the nuclei are free protons (hydrogen nuclei), just because most of the matter in space is hydrogen. 9% are alpha particles (helium nuclei), and 1% are the nuclei of other (heavier) elements. There is also a very small fraction of more exotic stuff, like antimatter.

              While the mean energy content of a cosmic ray particle is in the range of only about 10^-11 to 10^-10 J, extremely rare single particles with energy content up to 50 J exist. This energy is truly astounding, as it means a single submicroscopic particle has the same kinetic energy as a slowly pitched or fairly briskly thrown baseball!

              Cosmic [wikipedia.org] rays [caltech.edu] are some of the most penetrating radiative phenoma known. Just compare their mean atmospheric penetrative power [bham.ac.uk] to that of other radiative phenomena. The following represent rough mean values of what are actually widely distributed ranges; in other words, some fraction of cosmic rays penetrate hugely in excess of the figure quoted below, just as some fraction falls far short.

              cosmic "rays" - 10,000 m (about the same for both primary and secondary)
              gamma rays - 1000 m
              x-rays - 100 m
              alpha particles - 0.1 m

              It should also be noted that significant sources of radiative phenomena are generally point sources, or at least localized sources. They are attenuated in concentration, not total amount,by distance, even in a perfect vacuum. This arises due to spreading out according to the inverse square law. For example, if you want to escape the radiation from a nuclear explosion, even in outer space, you can just move away from it. Cosmic rays are completely different in that they are diffuse. They are not "radiating" from a single point at all. They are distributed in concentration and direction everywhere. There is no attenuation due purely to distance. The attenuation of cosmic rays by the atmosphere is a result of collisions of cosmic ray particles with the atoms in the atrmosphere.

              Cosmic rays, or better stated, cosmic ray products (neutrinos) have been detected in deep mineshafts after penetrating kilometers of rock. Clearly the beta particles are not penetrating very much at all, and even the nuclei have limited penetration, but some of the subnucleic particles ain't stoppin' for nobody.

          • by WWJohnBrowningDo ( 2792397 ) on Friday November 22, 2013 @08:12PM (#45497363)

            BRB, going to convincine my boss to tip all our servers over.

        • It doesn't necessarily assume that at all, but what it can assume is that rays coming down vertically have less atmosphere to travel through than rays at any angle, and thus have more energy when they hit the server. Same reason the midday sun has more heat than the morning or setting sun.

      • Re:Heat related? (Score:4, Informative)

        by barlevg ( 2111272 ) on Friday November 22, 2013 @12:12PM (#45492415)

        Back-of-the-envelope calculation using XCOM. [nist.gov]

        Assume server rack and contents are made of aluminum (what is the predominant material in a server rack?). Let's say the server rack is 2m in height, but it's not fair to make the whole thing metal. Let's say 20% of it is metal (aluminum for this calculation), the rest is air (or, for the sake of calculation, vacuum). Alumnium has a density of 2g / cm^3 (so a 1m x 1m x 0.4 m slab of alumnium would weigh 800 kg, which appears to be in the middling range for what a server rack can accomodate--again, keep in mind, this is a really rough calculation).

        Okay, plugging in Aluminum into XCOM gives a total attenuation in the 100-1k MeV range of ~0.03 cm^2/g.

        e^[-(0.03 cm^2/g) * (2g / cm^3) * 40 cm] = 0.09

        In other words, that's 90% attenuation. Keep in mind that this was a ridiculously sloppy calculation, with my material assumptions (and possibly energetic ranges) being way off (also, neutron cross-sections could easily be different than photon cross-sections). The point is, it's certainly possible (nay, likely) that the material of the servers themselves are providing shielding from the servers on the bottom of the rack.

      • by gl4ss ( 559668 )

        just take them out of the rack to test..

    • by edibobb ( 113989 )
      They took that into account.
    • by dszd0g ( 127522 )

      As single event upsets (SEU) are caused by cosmic particles which create alpha particles. It makes sense that equipment higher in the rack would absorb more of the alpha particles and block them from systems lower in the rack, but I am not a physicist. Alpha particles are relatively easy to block with shielding.

      http://www.statemaster.com/encyclopedia/Single_event-upset [statemaster.com]

      As the link said, this was first theorized in 1978 and supercomputer companies have been designing systems with this in mind for decades.

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Top of the rack tends to get toasty, but is this too simple?

      It is too simple.
      In a data center with downflow CRACs that push air through perforated tiles, sufficient underfloor plenum pressure is supposed to be maintained so that the upward air velocity carries cold air all the way up the front of the cabinet, affording sufficient cooling to everything. Not that it always works that way.

      But one thing to consider is dirt.
      Even with MERV 8 or better filtration, dust will still circulate in a data center cooled this way. With the filtration on the CRAC return, the ligh

      • by tibit ( 1762298 )

        Stupid question: why do blowers used in a data center need belts? These days, they should all be direct-driven by brushless motors. At most you need a coupling, although blower-duty brushless motors should have bearings sufficient to support the blower, thus you need no couplings. That way only one bearing is anywhere near being exposed to air that is blown around. I've been to a clean room facility that had all ventilation systems completely direct-driven, and the facilities people loved it.

      • Just curious. I've seen direct driven blowers in a number of various applications. Is there some special need to use belt driven blowers for the air in data centers?
    • by gweihir ( 88907 )

      No. Or at least not with competently designed racks, as they are cooler on the top.

      The reason is likely a lot more simple: Particles that cause this come from above. Traveling through a number of steel plates (2 per server) stops some of them and reduces energy for others. Hence less reach the bottom of the rack. In addition, those that do not come straight from above have to travel through more air, hence they are fewer or have less energy. See? Simple.

    • Top of the rack tends to get toasty, but is this too simple?

      This is an explanation they suggest in TFA

      • by gweihir ( 88907 )

        Then they do not know much about rack construction. Standard racks suck in cold air from the front (cold isle) and blow it out the back (hot isle). There is no difference whether the computer sits on the bottom or the top of the rack as the hot air from any of them never gets to another computer directly.

  • Fusion IO? (Score:4, Interesting)

    by shadowknot ( 853491 ) * on Friday November 22, 2013 @11:03AM (#45491601) Homepage Journal
    Someone tell Fusion.io. They're based at 5000+ feet here in the Salt Lake valley! It would be interesting if their QC procedures are what have made them more reliable as the failure rate is higher where the testing is performed.
  • basements (Score:5, Funny)

    by Anonymous Coward on Friday November 22, 2013 @11:06AM (#45491655)

    Another reason for nerds to stay in the basement

  • This isn't news (Score:5, Informative)

    by dszd0g ( 127522 ) on Friday November 22, 2013 @11:09AM (#45491693) Homepage

    This isn't news. Companies that make supercomputers have known this for decades. The one I worked for 15 years ago used a high elevation test environment in Colorado to verify error correcting capabilities. Even the article says that the results were not a surprise.

    • Re:This isn't news (Score:5, Informative)

      by edibobb ( 113989 ) on Friday November 22, 2013 @11:17AM (#45491815) Homepage
      From the article: "It is well known that the altitude at which a data center resides has consequences with regards to machine fault rates. The two primary causes of increased fault rates at higher altitude are reduced cooling due to lower air pressure and increased cosmic ray-induced neutron strikes."
      • by Hatta ( 162192 )

        So we should build data centers in abandoned mines. Plenty of shielding from cosmic rays, a steady 55F ambient temperature, and all the heat exchange capacity you could want.

      • by sootman ( 158191 )

        If the submitters actually took the time to read the articles, the quantity of stories posted here would drop substantially. NEXT PLEASE!

    • It is interesting though, and not having found out about it when it was or would have been news makes it a good Slashdot topic (if for nothing else that making more people aware).

  • A couple of years back at one of the Supercomputing conferences (I think in Phoenix), Fermilab had a cloud chamber in their booth, and you simply *would* *not* believe the amount of ambient radiation passing you at all times. I can easily believe that altitude would have an effect.

    Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center? What happens during an aurora?

    • by Antipater ( 2053064 ) on Friday November 22, 2013 @11:17AM (#45491809)

      Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center?

      They tried to do that test a few years back, but both research teams mysteriously disappeared. The leading hypothesis is that the Arctic team was eaten by polar bears, but nobody has any idea what happened to the Maui team. The only clue left at the scene was a nearly-empty glass of pina colada.

      • but nobody has any idea what happened to the Maui team. The only clue left at the scene was a nearly-empty glass of pina colada

        Japanese tourists (have you seen how they get when a little alcohol's added to the mix??).

      • by PPH ( 736903 )

        but nobody has any idea what happened to the Maui team.

        Didn't you watch Lost? They were eaten by polar bears.

      • They where taken out by shogotths whats that knocking at the door oh apparently its a nice nice man from the laundry says i have to come with them :-)
    • by cusco ( 717999 )

      My wife grew up in Puno, Peru, at 3840 meters (12,600 feet) altitude. You will get sunburned so fast you won't believe it, even when you're dark complected like me. Black African tourists get sunburned. IIRC, most of the air molecules belonging to Earth are well below that altitude. If the effect on ultraviolet light shielding is that dramatic I can't help but think that other cosmic radiation is going to be stronger at that altitude as well.

  • by Nkwe ( 604125 ) on Friday November 22, 2013 @11:13AM (#45491761)
    If you get high you can lose your memory?
  • According to the article the low elevation system was a Jaguar supercomputer whereas the high elevation one a Cielo supercomputer. Based on available specs for each the two are entirely different systems. How can they reach conclusions about altitude-relative bit error rates when they're not even comparing the system system? The article goes on to state:

    "The group had found that, when all other possible confounding issues were factored out, Cielo's SRAM had a "significantly higher rate of SRAM faults,"
  • by mdsolar ( 1045926 ) on Friday November 22, 2013 @11:23AM (#45491877) Homepage Journal
    It seems to me that an unexploited structure for a low radiation environment is the bottom side of a water tower. Steel has most radionuclides slagged off when it is produced while drinking water standards ensure the water in the tower will have low radioactivity. A meter or two of water forms a nice shield for cosmic rays from above while the air below the tower shields against lower energy ground radiation. And, you get a nice heat sink in the water for cooling electronic.
  • As this is not (mainly) about the system RAM, it's about the CPU caches, I wonder if any attempt is being made to correct the errors, and if it's worthwhile. One would just need to reset the node on any sign of an error, so the capactiy penalty would be small compared to ECC. On the other hand, the errors could just as well happen in the actual logical units, and at some point it's impossible or very expensive to protect everything. Because the SRAM takes up a large fraction of the CPU area, it may be usefu

  • Hmmm .... (Score:4, Funny)

    by gstoddart ( 321705 ) on Friday November 22, 2013 @11:34AM (#45491993) Homepage

    Is this why when I'm in an airplane I can never remember if I turned all the lights out? ;-)

    • Actually, my company makes flash memory and we ship most of it overseas for final packaging. We have to allow for a certain amount of die loss from cosmic rays striking the wafers while they are in flight.

  • . . . .recall that the new NSA "Supercenter" in Utah is at ~4300 feet. So they'll be making a lot MORE errors when monitoring us all. . .
  • The 1U lead block. Place at top of rack to protect the servers below.

    Does it work? Who cares. If people will pay £150 for a wooden volume knob on their audio system, someone is going to pay whatever you ask for a lump-o-lead that may or may not improve the reliability of equipment below.

  • Why did anyone need to do this field survey? It simply confirms what we already know - cosmic rays create SRAM errors. Hot components fail more than cold components. Big whoop.

    • Why did anyone need to do this field survey?

      Well, there's two possible responses to this.

      1) We're slashdot and we think we know everything, they should have just asked us, how dare they

      2) Maybe we might trust that a "A group of researchers from AMD and the Department of Energy's Los Alamos National Laboratory" aren't idiots and wanted specific empirical evidence on the topic?

    • by Tailhook ( 98486 )

      what we already know

      Conditions change. Every 18-24 months a new node appears — 22nm is the scale of contemporary shipping devices. As features shrink their behavior changes and new data is needed. There are applications that need to know error rates to compute how much mitigation is required.

      We're not all just making web pages out here.

  • cosmic ray flux (Score:3, Informative)

    by volvox_voxel ( 2752469 ) on Friday November 22, 2013 @12:08PM (#45492377)
    Here is a plot of the cosmic ray flux ( coincidence counting rate per second) vs altitude. It's also not hard to build a detector that can detect these. You can use something called coincidence detection where two scintillator plates are placed right on top of one another, and each plate is connected to a photomultiplier tube. If both photomultiplier tubes trigger, it's a cosmic ray event.. If only the top one triggers it could still be a muon though..

    http://hyperphysics.phy-astr.gsu.edu/hbase/astro/cosmic.html [gsu.edu]

  • Earth act as a shield that protect memory from radiation coming from the other side of the planet. In addition, the collision probability of a particle is proportional to the distance of his travel into the atmosphere, so there is more probability on the ground to be hit by particle coming from the vertical. On a desktop computer the RAM is usually oriented vertically and exposing his shorter side from the top: the exposed area is very small for radiation coming from the top. Not that because of the motherb

  • by lyapunov ( 241045 ) on Friday November 22, 2013 @12:18PM (#45492469)

    There are statistics that cover the expected frequency of events caused by radiation in the first couple of pages.

    http://docs.oracle.com/cd/E19095-01/sf3800.srvr/816-5053-10/816-5053-10.pdf [oracle.com]

  • I'm also more prone to errors when I'm high

  • From deep within the PDF (second link):

    The two primary causes of increased fault rates at higher altitude are reduced cooling due to lower air pressure and increased cosmic ray-induced neutron strikes.

    (Living in Colorado, I thought perhaps chips suffered from the same spurting newly opened toothpaste tube problem when manufactured at low altitude and installed into operation at high altitude, but it turned out the hypothesis was different, and, of course, left out of the Slashdot summary.)

  • What kind of materials (if any) are effective in blocking cosmic rays? Would it be possible to integrate cosmic radiation shielding into an average-sized PC case? If that's impractical, are there building materials that can be used in roofs and/or walls to block this stuff without breaking the bank?

    • by cusco ( 717999 )

      Depends on the type of cosmic rays you want to shield. For muons and the like, good luck. Huge amounts of mass are necessary, but then muons probably wouldn't interact much with your computer RAM anyway. The practical worry is alpha particles, which will flip a bit once in a while.

  • "However, it's not clear what causes this smaller-scale effect."
    Servers are made out of metal and have EM fields! This isn't hard.
  • In aggregate entire atmosphere down to sea level works out to something like the equivalent of 30ft of water of shielding.. 20% reduction thru an entire rack of servers sounds to be in about the right ballpark.

    People have been running the same experiments on international flights on laptops for years.

  • How long before the cloud computing and storage services start charging a slight premium to have your stuff run/store on lower spots in their server racks?
  • Comment removed based on user account deletion
  • While I guess why more ionizing events stemming from neutron impacts affects electronics, I don't get the blaming of "pressure"? Perhaps they mean the reduced air cooling of electrical components from a less dense atmosphere? Someone else noted that components on the top of a rack might tend to be warmer. This might be more of the same sort of effect.
    • by khallow ( 566160 )
      Never mind. I wasn't thinking it through. Pressure also is an indication of how many potentially heat absorbing particles are impacting a heat sink surface.
  • by Salamander ( 33735 ) <`jeff' `at' `pl.atyp.us'> on Friday November 22, 2013 @04:10PM (#45495067) Homepage Journal

    About five years ago, I was involved in the installation of a thousand-node cluster in Boulder. We knew *before we went in* that we needed to change our EDAC (memory error correction) code to account for the higher rate of bit-flips due to the altitude. Some of the people we were working with had been there when those same problems nearly caused a months-long delay in a larger installation at NCAR nearby. We ended up running into a more subtle problem involving lower air density, heat and voltage, but *this* problem was incredibly old news even then.

  • Just like humans, computers have trouble remembering things when high.
  • by aegl ( 1041528 ) on Friday November 22, 2013 @06:18PM (#45496525)

    Perhaps the researchers are too young to have read this 1979 paper http://www.ncbi.nlm.nih.gov/pubmed/17820742 [nih.gov]

Your password is pitifully obvious.

Working...