The Hidden Danger of Memory Errors in CAD Computing

30 Jan, 2018 By: Alex Herrera

Herrera on Hardware: They may arise infrequently, but the consequences of small memory errors can be huge. How can you reduce the chance of errors — and should you bother?

Hard Memory Failures Require Different Approaches

While it is an effective tool capable of correcting the rare flip of a bit (occurring in storage or en route between processor and memory), ECC isn’t the best approach to address persistent, or hard, errors, especially multiple errors from the same physical area in DRAM. Other approaches can help mitigate such hard failures with minimal downtime, however, by detecting and isolating the physical memory address ranges found to be the source of the failure. Post-package repair (PPR) and memory quarantining are two such technologies that can keep a workstation — and your workflow — running despite a hard memory failure.

PPR provides real-time DRAM repair … to a point. The internal structure of DRAM chips offers a clever way for hard memory errors to be remedied, without a service call or even powering down the system. Each DRAM chip’s memory cells are organized in an array of rows and columns. Modern standard specifications for memory devices (JEDEC memory standards) include a feature called post-package repair (PPR), which includes one to several spare, redundant rows per memory bank (for these purposes, think of a bank as a sizable chunk within the overall array). If the system detects a persistent failure at a specific memory address (corresponding to a cell or row of cells), it can issue a command cycle to the memory to swap out the bad row and replace it with one of the spare rows, if available.

With PPR, Samsung estimates that roughly half of memory function failures can be repaired, at least on the first occurrence in that bank. Intel’s reliability, availability, and serviceability (RAS) feature set includes support for PPR, available in the company’s Xeon Scalable and Xeon W platforms — which are driving the majority of workstations bought and used to run today’s CAD workflows.

PPR support in DDR memory can swap a failing row with a functional “spare” row. Image courtesy of Samsung.

Memory isolation and quarantine. Normally, a hard device failure will make the entire memory subsystem unusable, but a tool such as PPR can allow the computer to maintain functionality despite isolated failures within the memory structures. Memory quarantining approaches, such as Dell’s Reliable Memory Technology (RMT), work toward the same goal, but they do so in a different way. Supported on all of the company’s Precision workstations, RMT salvages the healthy areas of memory — usually the vast majority — by essentially quarantining only the defective area. Should RMT detect a persistent memory error, it identifies the defective memory and maps it out of usable system memory. The system can then continue to run normally, though the user will have to tolerate a smaller physical memory footprint. And if the reduction is severe enough to impact performance, replacing a DIMM will still be warranted. Regardless, RMT will keep the system running longer than it otherwise would, reducing costly downtime.

Dell’s Reliable Memory Technology (RMT) detects persistent, or hard, memory errors and keeps the workstation functional. Image courtesy of Dell.

1 2 3 

About the Author: Alex Herrera

Alex Herrera

Add comment

More News and Resources from Cadalyst Partners

For Mold Designers! Cadalyst has an area of our site focused on technologies and resources specific to the mold design professional. Sponsored by Siemens NX.  Visit the Equipped Mold Designer here!

For Architects! Cadalyst has an area of our site focused on technologies and resources specific to the building design professional. Sponsored by HP.  Visit the Equipped Architect here!