Many DDR3 Modules Vulnerable To Bit Rot By a Simple Program 138
New submitter Pelam writes: Researchers from Carnegie Mellon and Intel report that a large percentage of tested regular DDR3 modules flip bits in adjacent rows (PDF) when a voltage in a certain control line is forced to fluctuate. The program that triggers this is dead simple — just two memory reads with special relative offset and some cache control instructions in a tight loop. The researchers don't delve deeply into applications of this, but hint at possible security exploits. For example a rather theoretical attack on JVM sandbox using random bit flips (PDF) has been demonstrated before.
Applications include... crashing computers. (Score:1)
I don't know if there are hundreds or thousands or hundreds of thousands of low level 'bugs' like this related to simple subsystems abused in specific ways.. but there are plenty.
Re: (Score:1)
Halt and Catch Fire [wikipedia.org].
Re: (Score:2)
I would guess that it's theoretical because it involves things like knowing exactly where the JVM is positioned in physical memory, and how its pages are laid out. That, and that the demonstration involved knowing all of these things before you started.
Not theoretical. It's hogwash. (Score:5, Funny)
This is ridiculous. Realistically, when have you ever run into a situation where stib teg ylirartibra deppilf?
Re: (Score:1)
Many DDR3 modules? (Score:4, Insightful)
This is all very interesting but totally pointless! Which modules? Tell us the brands, model names, manufacturer numbers?
Comment removed (Score:5, Insightful)
Re: (Score:2)
It also means that 19 out of 129 DRAM modules are not affected by this problem, hence my question.
Comment removed (Score:5, Interesting)
Re: (Score:2)
Overwhelming majority of "PC gaming benchmark queens" wouldn't give a toss because memory speed hasn't been a bottleneck in gaming in many years.
People who would care are ordinary users and OEMs who would have to absorb the extra cost. Especially to OEMs costs are far from trivial.
Re: (Score:3)
Re: (Score:2)
I'm not sure whether I more bothered by "benchmark queens" or people who flame over their subjective opinions. The latter are a lot like "audiophiles", unwilling to believe in blind testing.
Re: (Score:2)
'I'm not sure whether I more bothered by "benchmark queens" or people who flame'.
FTFY. Does anyone ever flame about anything except subjective opinions?
Re:Many DDR3 modules? (Score:4, Funny)
Climate change... [ducks].
Re: (Score:2)
That crazy theory again!?!
Sir, I assure you that ducks absolutely do not cause climate change!
But note that pirates can slow it.
Re: (Score:3)
I'm also bothered by people who put the word audiophiles in scare quotes for no good reason. P.S. Not all audiophiles are opposed to blind testing; some people like expensive audio toys that are objectively better too.
Re: (Score:2)
They aren't scare quotes - they are there to differentiate people who think they can hear things that they really can't from people who truly chase better sound. If I hear anything about oxygen in your speaker wire, you'll get the quotes.
Re: (Score:2)
My understanding was that oxygen free copper is supposed to more fatigue tolerant so that it gives better plug-unplug endurance, not better sound.
Re: (Score:2)
I've seen nonsense about inductance and capacitance. And then it'll be stranded. Oy.
Most people are using it to make a permanent connection in their homes with stranded wire... so endurance, fatigue, corrosion are all non-issues. I would wager a very high sum of money that double-blind testing would result in no perceptible difference [consumerist.com].
Re: (Score:2)
Oops, had meant to say that in my comment - that very few people will need the "endurance" - I completely agree. I have to admit that I got suckered into buying zero-oxygen-copper cables (it sounds good, doesn't it), until I decided to check what it actually meant - zilch!
Re: (Score:3)
ALL of that audiophile stuff sounds good (pun intended).
Re: (Score:2)
Inductance and capacitance impact total impedance, and it is possible to find bad combinations where that turns into an easily measurable problem with the cable. See high cap wire [roger-russell.com] section of "Speaker Wire: A History" for how that comes out on a scope. It's very easy to find cases where the wire doesn't matter too. One of the funny things about objective audio testing is that people usually find what they set out to, because it's so easy to set up tests to give the results you want. That doesn't disprove
Re: (Score:2)
Oxygen-free copper [wikipedia.org] is very a much a real thing, and it does matter for some applications. The only part that's hard to support is whether those differences are audible in home audio. All other things being equal between two cables, it shouldn't matter. (All other things are usually not equal)
Re: (Score:2)
Perhaps you can measure things on a scope, but that doesn't mean the difference is perceptible. It's not my money, so I don't really care what audiophiles do with it - but they also seem to expect me to be impressed, which I am not. I politely nod but honestly think they are just burning their money. I can't take someone seriously who thinks that oxygen makes a perceptible difference in audio, and then think nothing of using stranded wire vs. solid. Even with an oscilloscope, the stranded vs. solid will be
Re: (Score:2)
Some ludicrously overpriced cable aimed at the mass market is stranded, with Monster being the biggest offender by volume. But most of the really expensive speaker cable is solid core instead of stranded, with the core size limited only by how flexible the cable needs to be. The stuff I like uses a number of 14 AWG wires that total to match 12 AWG. I've tried using twisted pairs of 12 AWG copper instead, just basic power cable from Home Depot, but I can barely route the stuff. I like the cables (and amp
Re: (Score:2)
Audio queen here, you probably mean double blind testing.
Re: (Score:2)
Even blind testing would be an improvement.
Re: (Score:2)
Memory speed can technically still be the bottleneck on large memory footprint games like BF4; see the bit-tech review [bit-tech.net] for some numbers on that. The people chasing after PC gaming benchmarks reflexively use the fastest memory around though, and if you do that it's less likely for memory to dictate the speed limits.
Re: (Score:2)
This used to be the problem back in the day before DDR3, true. After DD3 got to around 1333-1600MHz, the problem was effectively eliminated in favour of latency being the only reasonable bottleneck. And that actually gets worse rather than better when you increase the frequency
The tests you link show exactly that - no noticeable difference. They're looking at 1-2% difference between 1333 modules and 2400 modules. Because that is not the bottleneck. System is bottlenecked elsewhere, most likely on GPU. If th
Re: (Score:2)
That's not how it works. The way you spot a bottlenecks in performance work is that if you change anything else, there is zero impact on the resulting system speed. Conversely, if you alter something and the system really does get faster, you must have just hit one of the bottlenecks.
Given that, the way high detail performance goes from 83 to 86 FPS as RAM speed increases means that RAM speed must have been a bottleneck. If speed had been strictly limited by the video card instead, speeding up the memory
Re: (Score:2)
Actually that is how it works. Concept of a bottleneck refers to aspect of a pipe+pool system where thickness of the pipe is the limiting factor and increasing width of the pipe offers a comparable increase in flow throughput.
When you double pipe's thickness and get 1-2% more flow, it means that your system's bottleneck is elsewhere.
Re: (Score:2)
The latency at higher clock frequency does not increase in the way you suggest. It only appears that way because latency is measured in clock cycles so when the clock cycle is halved, twice as many are needed for a given duration.
Re: (Score:2)
Where did I post anything to suggest what you're suggesting?
It's well known that increasing RAM frequency impacts latency in net negative way. Your suggestion implies that impact is neutral, when it's rarely so unless you buy much more expensive RAM specifically picked and binned for those frequencies and latencies. Typical RAM sold incurs significant net negative impact on latency as frequency increases. Alternative is lower reliability.
Anyone who did any overclocking and worked with RAM memory doing it sh
Re: (Score:2)
And that [latency] actually gets worse rather than better when you increase the frequency
Increasing the RAM frequency has little or no effect on latency; it only changes the unit of measurement. Latency as measured in clock cycles goes up but latency measured in nanoseconds stays roughly the same (actually it generally gets better) and it is the later which matters as far as the processor is concerned.
The first word access time shown in this table
Re: (Score:2)
Took me a while to figure out what you're talking about. That's some exotic trolling. Well done. Shame no one cares about it this far down the chain.
Your case was specifically addressed long ago when I mentioned the costs. You've linked to standards table which addresses what kinds of memory are made. It's correct to state that in those standard, CAS latency generally gets net better as frequency goes up. What you are trolling on is costs - subject mentioned at the very beginning.
Re: (Score:2)
Memory speed can technically still be the bottleneck
And taking a piss before you head to work can save you gas money. Your link shows an 80% increase in memory speed giving a 1.7% increase in performance. Congrats, you just doubled your memory's power consumption.
Re: (Score:2)
Reminds me of the first time I ever heard this particular discussion: at DEC in about 1983. A colleague who had gone to do quality engineering on VAX/VMS systems asked for statistics on crashes caused by memory errors. All VAX computers had built-in ECC (of course), but the advanced thinkers in engineering were wondering if it would be more cost-effective to do without. Money would be saved, both by the manufacturer and the customer, and systems would run significantly faster (maybe). Surely that would be w
Re: (Score:2)
To this day I always ask for ECC whenever I buy a new PC - but the only machines I have ever found that had it were Dell workstations.
Always ECC user here as well. With Intel, only Xeon systems come with ECC support in the chipset. You are actually looking for any workstation level computer with a Xeon chip, although Dell is the only outfit with an even semi reasonable price.
Re: (Score:2)
Plenty have it on the server side. Just use a server board in your desktop.
Re: (Score:2)
That is indeed the problem with many technologies. "If they were standard, their costs would be much cheaper".
At which point the question becomes that of "is this functionality actually needed as a standard in most use scenarios?"
For ECC memory, this question was asked ever since the early 80s and the answer is still "no".
Re: (Score:2)
the issue is now exasperated.
Not being a pedant, just trying to be helpful: The word that you are looking for is exacerbated.
Re: (Score:2)
FTFP. "We induce errors in most DRAM modules (110 out of 129) from three major DRAM manufacturers."
Short version, leakage current from adjacent gates can nudge other to bit-flip. I don't think this is a manufacturing problem as it is a fundamental EE design oversight. So yeah, defective by design (unintentionally)!!
So, as ddr3 gets more dense, and space between the cells has decreased, we should be standardizing on ECC memory for all desktops and servers. The second thought I have is "What minimal cpu clockspeed would enable this activity to occur with standard hardware? " It this problem likely to occur with off the self hardware motherboards and cpus?
Re:Many DDR3 modules? (Score:5, Informative)
If you're wanting to narrow it down, you won't like this line from the paper:
It's pretty clever, and something I always wondered whether would be possible. They're exploiting the fact that DRAM rows need to be read every so often to refresh them because they leak charge, and eventually would fall below the noise threshold and be unreadable. Their exploit works by running code that - by heavily, cyclicly reading rows - makes adjacent rows leak faster than expected, leading to them falling below the noise threshold before they get refreshed.
Re: (Score:2)
That PDF has a lot of details but TL;DR, you were able to condense it into a single paragraph that we can read in a few seconds.
Thank you.
Re: (Score:2)
It can, but the chances of it staying perfectly readable is very small. And realize that removing RAM from a machine puts it under a very different condition than intentionally accessing the RAM in a pattern which causes faster than normal leakage, so the results aren't mutually exclusive.
Re: (Score:2)
If the module is supercooled quickly after its removed, it can be minutes before RAM bits start to wipe. Even if they do, RAM bits "erode" in a predictable manner allowing for information to be rebuilt if not degraded enough after power-down.
Re: (Score:2)
But I was assured that DRAM stays readable for minutes after they're removed from the machine?
http://it.slashdot.org/story/0... [slashdot.org]
Not if adjacent rows are being heavily, cyclicly read.
Re: (Score:2)
In those cases, there tend to be a LOT of errors. The risk is that enough will read correctly to leak valuable information like passwords. Also, in those cases the memory is not active.
Re: (Score:2)
It sounds like you know a bit about modern DRAM architecture. Data sheets now days are not avalable to the public, so it's hard to figure out basic things, like how much power is burned in the DRAM in a simple loop. Do you have a simple rule of thumb for modern DRAM power loss? If I understand correctly, static power is minimal, but dynamic power can generate several watts of power.
Re: (Score:3)
Datasheets ARE publicly available. However, they're for the actual DRAM ICs themselves, and not of the modules.
There are only a few DRAM manufacturers out there - Samsung, Hynix, Elpida, Micron are among them.
Samsung Computing DRAM [samsung.com] (they also have Graphics DRAM and others). Some of their newest chips don't have datasheets yet, but that'll be forthcoming. The older ones in production do, however.
Hynix [skhynix.com]
Micron (and Elpida) [micron.com].
These are all generally available. Sin
Re: (Score:3)
So, other than fixing the dram design, the solution is to refresh more frequently. A software fix might be a high priority background program that forces a full refresh at regular intervals (probably a big performance hit). If the CPU does its own dram control, there might be a register that affects refresh rate, or perhaps a microcode fix.
The problem is analog in nature, which suggests that optimized and very clean supply voltages, and very clean and precisely timed control signals might reduce or eliminat
Re: (Score:2)
Wear Leveling?
Leakage Leveling?
P.S. Question is whether a workaround is possible with the CPU microcode.
Re: (Score:2)
ECC is dismissed in the article, but the article ignores that ECC systems also have a scrubbing capability [wikipedia.org]
Unfortunately, ASUS is the only manufacturer that consistently includes ECC support in their AMD based motherboard line.
good news for ECC memory makers (Score:1)
Re:good news for ECC memory makers (Score:4, Informative)
According to the paper, EEC only reduces but does not eliminate the problem (section 6.3). Multiple bits can be corrupted at once.
Re: (Score:3)
Re: (Score:1)
Welcome to the Digital Dark Age
Re: (Score:2)
Ouch! Seriously bad. Worse than the Pentium FPU bug (and that's bad). What good is a computer if you can't rely on the data being committed back to disk because of corruption mid-flight in RAM?!
It apparently only happens if you read the same bytes from RAM 139,000 times in 64 milliseconds. If your program is doing that, you probably have a lot more to worry about than disk corruption.
If this was actually happening in the real world, computers would probably be crashing every few minutes.
Re: (Score:3)
If this was actually happening in the real world, computers would probably be crashing every few minutes.
You mean attackers have been exploiting this ever since Windows 95?
Re: (Score:2)
Re:good news for ECC memory makers (Score:5, Insightful)
Re: (Score:3)
ECC does not mitigate it, but it will detect the problem where non-ECC memory will happily keep on operating with the corrupted data.
For the standard car analogy, consider tire pressure monitoring systems. They won't stop you from getting a flat, but they'll let you know you have a slow leak where you might otherwise keep driving until it's bad enough that you notice otherwise. By that time the damage is done and you probably need a new tire.
Re: (Score:2)
The test numbers in section 6.3 show that ECC mitigates most of the errors, as the bulk of them are single bit ones. And if you're on a system that's prone to this problem, the odds are you will see a warning about that ECC correction kicking in long before you'll hit one of the uncorrectable multi-bit errors.
Re: (Score:3)
Difference being that the system is immediately halted if an uncorrectable error is discovered.
Malicious code can cause computers to crash (Score:3)
Of course if you can get the target computer to run certain code, you can completely wipe all the RAM, but wheres the fun in that huh..
Re: (Score:3)
This gives you a way to affect RAM outside of a sandbox.
Re: (Score:2)
This gives you a way to affect RAM outside of a sandbox.
Only if the sandbox lets you repeatedly access memory and flush the cache between accesses, and you happen to know where your data is in physical RAM.
Re: (Score:2)
Ah, yes, well I should have said "possibly" :)
Re: (Score:2)
It depends a bit on the physical structure of the RAM, but for the most part, the errors fall on logically adjacent rows (i.e. nearby memory addresses) in the RAM. So most of the time, you'll only affect other RAM inside your sandbox, and if you affect something outside the sandbox, it won't be far outside.
I remember encountering a similar failure when designing a system; the particular memory controller and the particular DRAM module we were using both met all applicable specs, but when used together in a
Does the cache control commands require root acces (Score:3)
Does the cache control commands require root access on Windows or Linux?
Re:Does the cache control commands require root ac (Score:5, Informative)
Re: (Score:2)
No. These are standard instructions that many apps require to function correctly when using multiple threads.
Can you explain when you'd need to flush the cache when using multiple threads? You'd have to flush the cache back to RAM (isn't that a privileged instruction?), invalidate it, then read the data back from RAM. That's surely insanely slow compared to just using the CPU's internal cache coherency mechanisms?
Re: (Score:2)
That seems more likely, but, when I was writing DMA code years ago, we put the buffers in non-cached RAM (and there were only written to from a driver in the kernel). Maybe explicit cache flushes are faster these days.
Re: (Score:2)
AMD has higher overall throughput for many GPU type work loads, but Intel shines with work loads that require thread syncing.
Why we need coders. (Score:1)
Re: (Score:1)
XD
Wow, a Forgetful Christmas Bug (Score:1)
The authors did a good job of covering the issue
Also, the paper is a good primer on dram stuff in general.
Unfortunately, this Christmas present.violates the Engineer's first rule.
Try to stay out of the news, because when you are in the news, it's usually not a good thing.
The failure mechanism:
There is is bug in most DDR3 chips built especially after 2010.
If you do too many read cycles in to short a time to the same row, some bits in an adjacent row may automagic
Re: (Score:2)
Thank you. Very helpful of you.
Re: (Score:2)
This makes me think of mid-90s Macs (Score:2)
Way back when RAM was stupid expensive, one way to reduce cost was to use so-called composite RAM. On high-end Macs back in the early-mid 1990s, that could cause the machine to not boot but instead play the first four notes of the Twilight Zone theme song.
Using Non-ECC Ram is Unacceptable (Score:2, Insightful)
Unless you are making a Speak-and-Spell, it's foolish not to use non-ECC RAM. I would rather pay an additional 9th as much and have some peace of mind that the RAM will at least keep from flipping a bit from comic rays, which happens about once a week.
I take that back; put it in the Speak-and-Spell, too.
Re: (Score:1)
I assume you meant "it's foolish to use non-ECC RAM".
Re: (Score:2)
Why was my comment moderated "Troll" when I merely pointed out that the parent had unintentionally inserted an extra negative in his statement? The drift of his comment was surely that ECC RAM is better. Yet he wrote "it's foolish not to use non-ECC RAM".
It's sad that moderators don't take the trouble to read what is in front of them. Or, worse still, that at least one moderator routinely mods my comments "Troll" without reading them.
Re: (Score:2)
Re: (Score:2)
How foolish and for what specific workload? I have a gaming rig where I sometimes edit photos and do 3d design and some light coding. In the past 10 years I've never seen any visible data corruption and not had an inexplicable crash.
So tell me again why I should spend the money? Your once a week problem sound note theoretical than practical.
memtest86 includes a test for this (Score:1)
Sort of already known 'weakness', recent memtest86 include the 'hammer test' for the purpose of testing this case, see http://www.passmark.com/forum/showthread.php?4836-MemTest86-v6-0-Beta
Why now? (Score:1)
Known issue (Score:5, Informative)
This has been know for some time. It's been referred to as "Row Hammer" and has been discussed at length by Intel and DRAM manufacturers.
https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#safe=off&q=intel%20row%20hammer
I've seen it cause multi-bit errors in ECC systems
So.... (Score:1)
Liquid nitrogen for your RAM then...?
Write-Only Memory (Score:2)
This is the reason I recommend that everyone invest in write-only memory for their computers. It is far more secure and hack proof than the alternatives.
Not the first time hammering caused trouble. (Score:2)
Story I heard about mid-20th-century IBM mainframe. (I think it was the 360 series).
Core memory was tight and had cooling issues. The designers examined the instruction set and determined that, given cacheing and the like, no infinite loop could hammer a particular location more than one cycle in four (25% duty cycle), for which cooling was adequate. So they shipped.
Turns out, though, you could do a VERY LONG FINITE loop that hit a location every other cycle, for 50% duty cycle (not to mention the possib
Allready known for flash memory (Score:1)
Read disturb was allready known for flash memory. Read disturb is when a flash cells flips a bit when other cells adjacent to the disturbed cell are repeatedly read.
Re: (Score:2)
We used to call this "pattern sensitivity" when applied to RAM.
Wow. Superbad. (Score:3)
Thats an evil bug. This could even be triggered accidentally by bad programming.
But more imporant, this allows you to break your VMs memory boundaries without any restriction. If you happen to make an educated guess about the memory layout of the physical machine and the host and guest kernel images loaded, you can try to
a) manipulate the host kernel directly (that would be nearly undetectable)
b) manipulate private keys in other VMs or the host
c) manipulate other VMs memory
d) communicate between VMs
And all of this independent of any software bug. The only thing which can be done about it would be to disable the feature on the simulated guest processor which allows to manipulate the cache arbitratily (and implicitely limit running guest programs to 1 core!). Alternatively,increase the refresh rate (i remember that the refresh rate could acturally be set manually in the 90s).
That being said, i just wonder if it possible to trigger this bug from a high level language (e.g. matlab) or the JVM where the operation causing the problem could be used implicitely for some vectorized code or other operations, e.g can this bug be triggered by the voilatile keyword in Java and accessign the memory in the same way?
Re: (Score:2)
It's not possible to do any of those.
1. The mechanism that this uses doesn't provide for deterministic results. At worst, rewriting the same row numerous times may result in some of the bits in spatially related rows being corrupted.
2. Address spaces are highly randomized and virtual to physical translation makes it incredibly difficult to obtain even an educated guess as to the layout.
This exploit just allows an attacker to possibly corrupt nearby data. It's a troll tool, nothing else.
Re: (Score:2)
Maybe. Maybe not. Not sure what the effect of secod order page translation would be if you manage to trigger the loading of a module (of the first use of memory in a module) in another VM after your VM hase been loaded. If you manage to trigger the access to the modules data memory, whci normally may be unuses after you allocate ("pad") enough memory, i could imagine that you can actually kill "nearby" data (which in Second order translation would apprear physically close to you memory).
I am not saying that