The Hidden Danger of Memory Errors in CAD Computing31 Jan, 2018 By: Alex Herrera
Herrera on Hardware: They may arise infrequently, but the consequences of small memory errors can be huge. How can you reduce the chance of errors — and should you bother?
Computers can make mistakes. No, I’m not referring to mistakes made based on bugs in an application’s code, source data, or those caused by a user’s erroneous keystrokes or mouse clicks. I’m talking about errors the machine’s hardware itself creates, and specifically errors in memory, due to no fault of the user, the application, or the operating system.
Based on the rarity of such occurrences, many would quickly dismiss concerns about them. But there are two good reasons to minimize such errors or, at the very least, try to mitigate any potential damage caused. First, despite the rarity, it’s important to consider the possible consequences; even one error can incur hefty, irreversible damages. And second, the costs and complexity of remedies are now so low that it’s difficult to argue against the additional investment, regardless of how unlikely it is that errors will arise.
Memory Errors: What Kind, How Often, and How Bad?
Memory bit errors come in two basic categories, both of which are problematic: persistent (“hard”) errors, caused by a hardware failure in a dynamic random-access memory (DRAM) chip or dual in-line memory module (DIMM, a small, motherboard-slotted card populated with the memory chips), and transient (“soft”) errors stemming from stored bits that get flipped, either while stored in memory or during the transmission of those bits between processor and memory.
Errors might occur in processor code (instructions) read from memory, or in data that the code is attempting to process. Both types can be catastrophic, but errors in data are arguably more worrisome than errors in instructions. Why? Because a bad instruction will typically cause an exception and crash the system — something you can’t help but notice — while bad data could go overlooked and provide credible, but erroneous, results.
The mechanisms producing hard and soft errors are different as well. Hard errors — those caused by deficient or flaky electrical integrity transmitting data between processor and memory — can be kept to an absolute minimum with diligent chip and motherboard design. But soft errors are theoretically unavoidable, even on the healthiest of underlying hardware. Most notably, storage bits within the DRAM chips and DIMMs themselves may flip from ones to zeros (or vice versa) due to stray radiation, such as cosmic rays. If that seems hard to believe, remember a DRAM bit is a one or zero based on the tiniest of charges on a capacitor that is constantly discharging (hence the need for DRAM “refresh”), so even a small spike of extraneous radiation — due to a solar flare, for example — can cause a soft memory error.
Memory errors certainly don’t occur often, but how are we defining “often”? Let’s look at some statistical data, much of which comes from big datacenter operators like Google and Amazon — proprietors with a huge vested interest in ascertaining the probabilities as accurately as possible. Toward that end, and in conjunction with the University of Toronto, Google undertook one of the most exhaustive studies ever done on memory errors in real-world conditions. The company’s seminal “DRAM Errors in the Wild: A Large-Scale Field Study” yielded some valuable and surprising statistics. For example, up to 8% of DIMMs experienced some type of memory failure per year, and between 12 and 45% of Google’s own machines experienced at least one DRAM error per year.
Because memory errors are rare, users and makers of virtually all consumer-oriented devices — and even the majority of corporate computers — don’t typically concern themselves with the risk. But workstation computing spaces are different, and the potential of even one catastrophic failure may be too much to bear for some customers in demanding and/or high-stakes applications. And that’s precisely why vendors of the systems built for such professionals — workstations — offer options to drive the risk of any catastrophic memory errors down to virtually zero.
Key Error-Mitigation Technologies in Workstations
Workstation OEMs support technologies to mitigate or eliminate the impact of both hard and soft memory errors. The first and most common line of defense in protecting memory data from corruption is error-correcting code (ECC) support, an option available in many server and workstation platforms. ECC can not only detect errors occurring in memory, but can correct them. To do that, ECC schemes incorporate extra data bits in a memory DIMM. Those bits are dedicated to storing a code (or hash) to allow quick detection of errors in bits read.
While in theory, a solution can be created to detect and correct any number of errors, the rarity of flipped bits combined with the extra cost and complexity limit the typical solution to single-error correction and double-error detection (SECDED). Such ECC DIMMs typically populate one extra DRAM chip per DIMM (nine chips instead of eight) for code storage, and must be supported by the processor’s memory controller to function. With all modern workstation and server central processing units (CPUs) now integrating memory controllers on-chip, that means support must be built into the CPU. That’s an important distinction to make, because Intel, for example, exposes ECC in its workstation/server-focused Xeon platforms, but not in its consumer/corporate-focused Core platforms.
DIMMs supporting single-bit ECC typically support one extra DRAM chip per module. Image courtesy of Puget Systems.