The Hidden Danger of Memory Errors in CAD Computing31 Jan, 2018 By: Alex Herrera
Herrera on Hardware: They may arise infrequently, but the consequences of small memory errors can be huge. How can you reduce the chance of errors — and should you bother?
Hard Memory Failures Require Different Approaches
While it is an effective tool capable of correcting the rare flip of a bit (occurring in storage or en route between processor and memory), ECC isn’t the best approach to address persistent, or hard, errors, especially multiple errors from the same physical area in DRAM. Other approaches can help mitigate such hard failures with minimal downtime, however, by detecting and isolating the physical memory address ranges found to be the source of the failure. Post-package repair (PPR) and memory quarantining are two such technologies that can keep a workstation — and your workflow — running despite a hard memory failure.
PPR provides real-time DRAM repair … to a point. The internal structure of DRAM chips offers a clever way for hard memory errors to be remedied, without a service call or even powering down the system. Each DRAM chip’s memory cells are organized in an array of rows and columns. Modern standard specifications for memory devices (JEDEC memory standards) include a feature called post-package repair (PPR), which includes one to several spare, redundant rows per memory bank (for these purposes, think of a bank as a sizable chunk within the overall array). If the system detects a persistent failure at a specific memory address (corresponding to a cell or row of cells), it can issue a command cycle to the memory to swap out the bad row and replace it with one of the spare rows, if available.
With PPR, Samsung estimates that roughly half of memory function failures can be repaired, at least on the first occurrence in that bank. Intel’s reliability, availability, and serviceability (RAS) feature set includes support for PPR, available in the company’s Xeon Scalable and Xeon W platforms — which are driving the majority of workstations bought and used to run today’s CAD workflows.
PPR support in DDR memory can swap a failing row with a functional “spare” row. Image courtesy of Samsung.
Memory isolation and quarantine. Normally, a hard device failure will make the entire memory subsystem unusable, but a tool such as PPR can allow the computer to maintain functionality despite isolated failures within the memory structures. Memory quarantining approaches, such as Dell’s Reliable Memory Technology (RMT), work toward the same goal, but they do so in a different way. Supported on all of the company’s Precision workstations, RMT salvages the healthy areas of memory — usually the vast majority — by essentially quarantining only the defective area. Should RMT detect a persistent memory error, it identifies the defective memory and maps it out of usable system memory. The system can then continue to run normally, though the user will have to tolerate a smaller physical memory footprint. And if the reduction is severe enough to impact performance, replacing a DIMM will still be warranted. Regardless, RMT will keep the system running longer than it otherwise would, reducing costly downtime.
Dell’s Reliable Memory Technology (RMT) detects persistent, or hard, memory errors and keeps the workstation functional. Image courtesy of Dell.