Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Hardware Hacking Intel Security Hardware

Many DDR3 Modules Vulnerable To Bit Rot By a Simple Program 138

New submitter Pelam writes: Researchers from Carnegie Mellon and Intel report that a large percentage of tested regular DDR3 modules flip bits in adjacent rows (PDF) when a voltage in a certain control line is forced to fluctuate. The program that triggers this is dead simple — just two memory reads with special relative offset and some cache control instructions in a tight loop. The researchers don't delve deeply into applications of this, but hint at possible security exploits. For example a rather theoretical attack on JVM sandbox using random bit flips (PDF) has been demonstrated before.
This discussion has been archived. No new comments can be posted.

Many DDR3 Modules Vulnerable To Bit Rot By a Simple Program

Comments Filter:
  • I don't know if there are hundreds or thousands or hundreds of thousands of low level 'bugs' like this related to simple subsystems abused in specific ways.. but there are plenty.

  • Many DDR3 modules? (Score:4, Insightful)

    by ArcadeMan ( 2766669 ) on Wednesday December 24, 2014 @09:12AM (#48666425)

    This is all very interesting but totally pointless! Which modules? Tell us the brands, model names, manufacturer numbers?

    • Comment removed (Score:5, Insightful)

      by account_deleted ( 4530225 ) on Wednesday December 24, 2014 @09:21AM (#48666463)
      Comment removed based on user account deletion
      • It also means that 19 out of 129 DRAM modules are not affected by this problem, hence my question.

        • Comment removed (Score:5, Interesting)

          by account_deleted ( 4530225 ) on Wednesday December 24, 2014 @09:37AM (#48666555)
          Comment removed based on user account deletion
          • by Luckyo ( 1726890 )

            Overwhelming majority of "PC gaming benchmark queens" wouldn't give a toss because memory speed hasn't been a bottleneck in gaming in many years.

            People who would care are ordinary users and OEMs who would have to absorb the extra cost. Especially to OEMs costs are far from trivial.

            • Comment removed based on user account deletion
              • I'm not sure whether I more bothered by "benchmark queens" or people who flame over their subjective opinions. The latter are a lot like "audiophiles", unwilling to believe in blind testing.

                • 'I'm not sure whether I more bothered by "benchmark queens" or people who flame'.

                  FTFY. Does anyone ever flame about anything except subjective opinions?

                • I'm also bothered by people who put the word audiophiles in scare quotes for no good reason. P.S. Not all audiophiles are opposed to blind testing; some people like expensive audio toys that are objectively better too.

                  • They aren't scare quotes - they are there to differentiate people who think they can hear things that they really can't from people who truly chase better sound. If I hear anything about oxygen in your speaker wire, you'll get the quotes.

                    • My understanding was that oxygen free copper is supposed to more fatigue tolerant so that it gives better plug-unplug endurance, not better sound.

                    • I've seen nonsense about inductance and capacitance. And then it'll be stranded. Oy.

                      Most people are using it to make a permanent connection in their homes with stranded wire... so endurance, fatigue, corrosion are all non-issues. I would wager a very high sum of money that double-blind testing would result in no perceptible difference [consumerist.com].

                    • Oops, had meant to say that in my comment - that very few people will need the "endurance" - I completely agree. I have to admit that I got suckered into buying zero-oxygen-copper cables (it sounds good, doesn't it), until I decided to check what it actually meant - zilch!

                    • ALL of that audiophile stuff sounds good (pun intended).

                    • Inductance and capacitance impact total impedance, and it is possible to find bad combinations where that turns into an easily measurable problem with the cable. See high cap wire [roger-russell.com] section of "Speaker Wire: A History" for how that comes out on a scope. It's very easy to find cases where the wire doesn't matter too. One of the funny things about objective audio testing is that people usually find what they set out to, because it's so easy to set up tests to give the results you want. That doesn't disprove

                    • Oxygen-free copper [wikipedia.org] is very a much a real thing, and it does matter for some applications. The only part that's hard to support is whether those differences are audible in home audio. All other things being equal between two cables, it shouldn't matter. (All other things are usually not equal)

                    • Perhaps you can measure things on a scope, but that doesn't mean the difference is perceptible. It's not my money, so I don't really care what audiophiles do with it - but they also seem to expect me to be impressed, which I am not. I politely nod but honestly think they are just burning their money. I can't take someone seriously who thinks that oxygen makes a perceptible difference in audio, and then think nothing of using stranded wire vs. solid. Even with an oscilloscope, the stranded vs. solid will be

                    • Some ludicrously overpriced cable aimed at the mass market is stranded, with Monster being the biggest offender by volume. But most of the really expensive speaker cable is solid core instead of stranded, with the core size limited only by how flexible the cable needs to be. The stuff I like uses a number of 14 AWG wires that total to match 12 AWG. I've tried using twisted pairs of 12 AWG copper instead, just basic power cable from Home Depot, but I can barely route the stuff. I like the cables (and amp

                • Audio queen here, you probably mean double blind testing.

            • Memory speed can technically still be the bottleneck on large memory footprint games like BF4; see the bit-tech review [bit-tech.net] for some numbers on that. The people chasing after PC gaming benchmarks reflexively use the fastest memory around though, and if you do that it's less likely for memory to dictate the speed limits.

              • by Luckyo ( 1726890 )

                This used to be the problem back in the day before DDR3, true. After DD3 got to around 1333-1600MHz, the problem was effectively eliminated in favour of latency being the only reasonable bottleneck. And that actually gets worse rather than better when you increase the frequency

                The tests you link show exactly that - no noticeable difference. They're looking at 1-2% difference between 1333 modules and 2400 modules. Because that is not the bottleneck. System is bottlenecked elsewhere, most likely on GPU. If th

                • That's not how it works. The way you spot a bottlenecks in performance work is that if you change anything else, there is zero impact on the resulting system speed. Conversely, if you alter something and the system really does get faster, you must have just hit one of the bottlenecks.

                  Given that, the way high detail performance goes from 83 to 86 FPS as RAM speed increases means that RAM speed must have been a bottleneck. If speed had been strictly limited by the video card instead, speeding up the memory

                  • by Luckyo ( 1726890 )

                    Actually that is how it works. Concept of a bottleneck refers to aspect of a pipe+pool system where thickness of the pipe is the limiting factor and increasing width of the pipe offers a comparable increase in flow throughput.

                    When you double pipe's thickness and get 1-2% more flow, it means that your system's bottleneck is elsewhere.

                • by Agripa ( 139780 )

                  This used to be the problem back in the day before DDR3, true. After DD3 got to around 1333-1600MHz, the problem was effectively eliminated in favour of latency being the only reasonable bottleneck. And that actually gets worse rather than better when you increase the frequency

                  The latency at higher clock frequency does not increase in the way you suggest. It only appears that way because latency is measured in clock cycles so when the clock cycle is halved, twice as many are needed for a given duration.

                  • by Luckyo ( 1726890 )

                    Where did I post anything to suggest what you're suggesting?

                    It's well known that increasing RAM frequency impacts latency in net negative way. Your suggestion implies that impact is neutral, when it's rarely so unless you buy much more expensive RAM specifically picked and binned for those frequencies and latencies. Typical RAM sold incurs significant net negative impact on latency as frequency increases. Alternative is lower reliability.

                    Anyone who did any overclocking and worked with RAM memory doing it sh

                    • by Agripa ( 139780 )

                      Where did I post anything to suggest what you're suggesting?

                      And that [latency] actually gets worse rather than better when you increase the frequency

                      Increasing the RAM frequency has little or no effect on latency; it only changes the unit of measurement. Latency as measured in clock cycles goes up but latency measured in nanoseconds stays roughly the same (actually it generally gets better) and it is the later which matters as far as the processor is concerned.

                      The first word access time shown in this table

                    • by Luckyo ( 1726890 )

                      Took me a while to figure out what you're talking about. That's some exotic trolling. Well done. Shame no one cares about it this far down the chain.

                      Your case was specifically addressed long ago when I mentioned the costs. You've linked to standards table which addresses what kinds of memory are made. It's correct to state that in those standard, CAS latency generally gets net better as frequency goes up. What you are trolling on is costs - subject mentioned at the very beginning.

              • by Bengie ( 1121981 )

                Memory speed can technically still be the bottleneck

                And taking a piss before you head to work can save you gas money. Your link shows an 80% increase in memory speed giving a 1.7% increase in performance. Congrats, you just doubled your memory's power consumption.

          • the issue is now exasperated.

            Not being a pedant, just trying to be helpful: The word that you are looking for is exacerbated.

      • FTFP. "We induce errors in most DRAM modules (110 out of 129) from three major DRAM manufacturers."

        Short version, leakage current from adjacent gates can nudge other to bit-flip. I don't think this is a manufacturing problem as it is a fundamental EE design oversight. So yeah, defective by design (unintentionally)!!

        So, as ddr3 gets more dense, and space between the cells has decreased, we should be standardizing on ECC memory for all desktops and servers. The second thought I have is "What minimal cpu clockspeed would enable this activity to occur with standard hardware? " It this problem likely to occur with off the self hardware motherboards and cpus?

    • by Rei ( 128717 ) on Wednesday December 24, 2014 @09:32AM (#48666531) Homepage

      If you're wanting to narrow it down, you won't like this line from the paper:

      In particular, all modules manufactured in the past two years (2012 and 2013) were vulnerable,

      It's pretty clever, and something I always wondered whether would be possible. They're exploiting the fact that DRAM rows need to be read every so often to refresh them because they leak charge, and eventually would fall below the noise threshold and be unreadable. Their exploit works by running code that - by heavily, cyclicly reading rows - makes adjacent rows leak faster than expected, leading to them falling below the noise threshold before they get refreshed.

      • That PDF has a lot of details but TL;DR, you were able to condense it into a single paragraph that we can read in a few seconds.

        Thank you.

      • It sounds like you know a bit about modern DRAM architecture. Data sheets now days are not avalable to the public, so it's hard to figure out basic things, like how much power is burned in the DRAM in a simple loop. Do you have a simple rule of thumb for modern DRAM power loss? If I understand correctly, static power is minimal, but dynamic power can generate several watts of power.

        • by tlhIngan ( 30335 )

          Data sheets now days are not avalable to the public

          Datasheets ARE publicly available. However, they're for the actual DRAM ICs themselves, and not of the modules.

          There are only a few DRAM manufacturers out there - Samsung, Hynix, Elpida, Micron are among them.

          Samsung Computing DRAM [samsung.com] (they also have Graphics DRAM and others). Some of their newest chips don't have datasheets yet, but that'll be forthcoming. The older ones in production do, however.

          Hynix [skhynix.com]

          Micron (and Elpida) [micron.com].

          These are all generally available. Sin

      • So, other than fixing the dram design, the solution is to refresh more frequently. A software fix might be a high priority background program that forces a full refresh at regular intervals (probably a big performance hit). If the CPU does its own dram control, there might be a register that affects refresh rate, or perhaps a microcode fix.

        The problem is analog in nature, which suggests that optimized and very clean supply voltages, and very clean and precisely timed control signals might reduce or eliminat

  • as for me, i'll wait for some real world examples of this possible exploit before i switch to ECC memeory, which would mean a new MB on top of the more exp memory.
    • by Rei ( 128717 ) on Wednesday December 24, 2014 @09:35AM (#48666533) Homepage

      According to the paper, EEC only reduces but does not eliminate the problem (section 6.3). Multiple bits can be corrupted at once.

      • Comment removed based on user account deletion
        • by Anonymous Coward

          Welcome to the Digital Dark Age

        • by 0123456 ( 636235 )

          Ouch! Seriously bad. Worse than the Pentium FPU bug (and that's bad). What good is a computer if you can't rely on the data being committed back to disk because of corruption mid-flight in RAM?!

          It apparently only happens if you read the same bytes from RAM 139,000 times in 64 milliseconds. If your program is doing that, you probably have a lot more to worry about than disk corruption.

          If this was actually happening in the real world, computers would probably be crashing every few minutes.

          • If this was actually happening in the real world, computers would probably be crashing every few minutes.

            You mean attackers have been exploiting this ever since Windows 95?

        • Worse problem; VM server farms. If you can run arbitrary code, you might be able to flip bits in the hypervisor or another VM.
      • by sshir ( 623215 ) on Wednesday December 24, 2014 @10:19AM (#48666757)
        At least with ECC you'll get _some_ feedback (it's random so it will pop from time to time) indicating that something fishy is going on. With regular ram all corruptions are silent so you'll get random crashes that will drive you crazy...
      • Difference being that the system is immediately halted if an uncorrectable error is discovered.

  • by rossdee ( 243626 ) on Wednesday December 24, 2014 @09:47AM (#48666595)

    Of course if you can get the target computer to run certain code, you can completely wipe all the RAM, but wheres the fun in that huh..

    • This gives you a way to affect RAM outside of a sandbox.

      • by 0123456 ( 636235 )

        This gives you a way to affect RAM outside of a sandbox.

        Only if the sandbox lets you repeatedly access memory and flush the cache between accesses, and you happen to know where your data is in physical RAM.

      • It depends a bit on the physical structure of the RAM, but for the most part, the errors fall on logically adjacent rows (i.e. nearby memory addresses) in the RAM. So most of the time, you'll only affect other RAM inside your sandbox, and if you affect something outside the sandbox, it won't be far outside.

        I remember encountering a similar failure when designing a system; the particular memory controller and the particular DRAM module we were using both met all applicable specs, but when used together in a

  • Does the cache control commands require root access on Windows or Linux?

    • by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Wednesday December 24, 2014 @10:31AM (#48666831) Homepage
      No. These are standard instructions that many apps require to function correctly when using multiple threads. Even if you aren't using them directly, at least some of the APIs you use most certainly are.
      • by 0123456 ( 636235 )

        No. These are standard instructions that many apps require to function correctly when using multiple threads.

        Can you explain when you'd need to flush the cache when using multiple threads? You'd have to flush the cache back to RAM (isn't that a privileged instruction?), invalidate it, then read the data back from RAM. That's surely insanely slow compared to just using the CPU's internal cache coherency mechanisms?

  • "just two memory reads with special relative offset and some cache control instructions in a tight loop" Yuh hurt yer what?
  • by Anonymous Coward

    The authors did a good job of covering the issue
    Also, the paper is a good primer on dram stuff in general.
    Unfortunately, this Christmas present.violates the Engineer's first rule.
    Try to stay out of the news, because when you are in the news, it's usually not a good thing.

    The failure mechanism:

    There is is bug in most DDR3 chips built especially after 2010.
    If you do too many read cycles in to short a time to the same row, some bits in an adjacent row may automagic

  • Way back when RAM was stupid expensive, one way to reduce cost was to use so-called composite RAM. On high-end Macs back in the early-mid 1990s, that could cause the machine to not boot but instead play the first four notes of the Twilight Zone theme song.

  • Unless you are making a Speak-and-Spell, it's foolish not to use non-ECC RAM. I would rather pay an additional 9th as much and have some peace of mind that the RAM will at least keep from flipping a bit from comic rays, which happens about once a week.

    I take that back; put it in the Speak-and-Spell, too.

    • I assume you meant "it's foolish to use non-ECC RAM".

      • Why was my comment moderated "Troll" when I merely pointed out that the parent had unintentionally inserted an extra negative in his statement? The drift of his comment was surely that ECC RAM is better. Yet he wrote "it's foolish not to use non-ECC RAM".

        It's sad that moderators don't take the trouble to read what is in front of them. Or, worse still, that at least one moderator routinely mods my comments "Troll" without reading them.

    • This is true. However, getting a laptop with ECC RAM straight from the manufacturer is never an option, and impossible when RAM is soldered onto the motherboard. I think if Apple started using ECC RAM, and advertised it, others might follow suit (like with the "retina" displays).
    • How foolish and for what specific workload? I have a gaming rig where I sometimes edit photos and do 3d design and some light coding. In the past 10 years I've never seen any visible data corruption and not had an inexplicable crash.

      So tell me again why I should spend the money? Your once a week problem sound note theoretical than practical.

  • by Anonymous Coward

    Sort of already known 'weakness', recent memtest86 include the 'hammer test' for the purpose of testing this case, see http://www.passmark.com/forum/showthread.php?4836-MemTest86-v6-0-Beta

  • This paper was published at ISCA in June and on Soylent News earlier today (or possibly yesterday). Why is it suddenly being circulated six months after publication? Someone trying to promote ECC memory?
  • Known issue (Score:5, Informative)

    by Anonymous Coward on Wednesday December 24, 2014 @12:10PM (#48667487)

    This has been know for some time. It's been referred to as "Row Hammer" and has been discussed at length by Intel and DRAM manufacturers.

    https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#safe=off&q=intel%20row%20hammer

    I've seen it cause multi-bit errors in ECC systems

  • Liquid nitrogen for your RAM then...?

  • This is the reason I recommend that everyone invest in write-only memory for their computers. It is far more secure and hack proof than the alternatives.

  • Story I heard about mid-20th-century IBM mainframe. (I think it was the 360 series).

    Core memory was tight and had cooling issues. The designers examined the instruction set and determined that, given cacheing and the like, no infinite loop could hammer a particular location more than one cycle in four (25% duty cycle), for which cooling was adequate. So they shipped.

    Turns out, though, you could do a VERY LONG FINITE loop that hit a location every other cycle, for 50% duty cycle (not to mention the possib

  • Read disturb was allready known for flash memory. Read disturb is when a flash cells flips a bit when other cells adjacent to the disturbed cell are repeatedly read.

  • by drolli ( 522659 ) on Wednesday December 24, 2014 @10:32PM (#48671063) Journal

    Thats an evil bug. This could even be triggered accidentally by bad programming.

    But more imporant, this allows you to break your VMs memory boundaries without any restriction. If you happen to make an educated guess about the memory layout of the physical machine and the host and guest kernel images loaded, you can try to

    a) manipulate the host kernel directly (that would be nearly undetectable)

    b) manipulate private keys in other VMs or the host

    c) manipulate other VMs memory

    d) communicate between VMs

    And all of this independent of any software bug. The only thing which can be done about it would be to disable the feature on the simulated guest processor which allows to manipulate the cache arbitratily (and implicitely limit running guest programs to 1 core!). Alternatively,increase the refresh rate (i remember that the refresh rate could acturally be set manually in the 90s).

    That being said, i just wonder if it possible to trigger this bug from a high level language (e.g. matlab) or the JVM where the operation causing the problem could be used implicitely for some vectorized code or other operations, e.g can this bug be triggered by the voilatile keyword in Java and accessign the memory in the same way?

    • It's not possible to do any of those.

      1. The mechanism that this uses doesn't provide for deterministic results. At worst, rewriting the same row numerous times may result in some of the bits in spatially related rows being corrupted.

      2. Address spaces are highly randomized and virtual to physical translation makes it incredibly difficult to obtain even an educated guess as to the layout.

      This exploit just allows an attacker to possibly corrupt nearby data. It's a troll tool, nothing else.

      • by drolli ( 522659 )

        Maybe. Maybe not. Not sure what the effect of secod order page translation would be if you manage to trigger the loading of a module (of the first use of memory in a module) in another VM after your VM hase been loaded. If you manage to trigger the access to the modules data memory, whci normally may be unuses after you allocate ("pad") enough memory, i could imagine that you can actually kill "nearby" data (which in Second order translation would apprear physically close to you memory).

        I am not saying that

Technology is dominated by those who manage what they do not understand.

Working...