The Hidden Danger of Memory Errors in CAD Computing

31 Jan, 2018 By: Alex Herrera

Herrera on Hardware: They may arise infrequently, but the consequences of small memory errors can be huge. How can you reduce the chance of errors — and should you bother?

Hard Memory Failures Require Different Approaches

While it is an effective tool capable of correcting the rare flip of a bit (occurring in storage or en route between processor and memory), ECC isn’t the best approach to address persistent, or hard, errors, especially multiple errors from the same physical area in DRAM. Other approaches can help mitigate such hard failures with minimal downtime, however, by detecting and isolating the physical memory address ranges found to be the source of the failure. Post-package repair (PPR) and memory quarantining are two such technologies that can keep a workstation — and your workflow — running despite a hard memory failure.

PPR provides real-time DRAM repair … to a point. The internal structure of DRAM chips offers a clever way for hard memory errors to be remedied, without a service call or even powering down the system. Each DRAM chip’s memory cells are organized in an array of rows and columns. Modern standard specifications for memory devices (JEDEC memory standards) include a feature called post-package repair (PPR), which includes one to several spare, redundant rows per memory bank (for these purposes, think of a bank as a sizable chunk within the overall array). If the system detects a persistent failure at a specific memory address (corresponding to a cell or row of cells), it can issue a command cycle to the memory to swap out the bad row and replace it with one of the spare rows, if available.

With PPR, Samsung estimates that roughly half of memory function failures can be repaired, at least on the first occurrence in that bank. Intel’s reliability, availability, and serviceability (RAS) feature set includes support for PPR, available in the company’s Xeon Scalable and Xeon W platforms — which are driving the majority of workstations bought and used to run today’s CAD workflows.

PPR support in DDR memory can swap a failing row with a functional “spare” row. Image courtesy of Samsung.

Memory isolation and quarantine. Normally, a hard device failure will make the entire memory subsystem unusable, but a tool such as PPR can allow the computer to maintain functionality despite isolated failures within the memory structures. Memory quarantining approaches, such as Dell’s Reliable Memory Technology (RMT), work toward the same goal, but they do so in a different way. Supported on all of the company’s Precision workstations, RMT salvages the healthy areas of memory — usually the vast majority — by essentially quarantining only the defective area. Should RMT detect a persistent memory error, it identifies the defective memory and maps it out of usable system memory. The system can then continue to run normally, though the user will have to tolerate a smaller physical memory footprint. And if the reduction is severe enough to impact performance, replacing a DIMM will still be warranted. Regardless, RMT will keep the system running longer than it otherwise would, reducing costly downtime.

Dell’s Reliable Memory Technology (RMT) detects persistent, or hard, memory errors and keeps the workstation functional. Image courtesy of Dell.

1 2 3 

About the Author: Alex Herrera

Alex Herrera

Add comment

Note: Comments are moderated and will appear live after approval by the site moderator.

AutoCAD Tips!

Lynn Allen

In her easy-to-follow, friendly style, long-time Cadalyst contributing editor Lynn Allen guides you through a new feature or time-saving trick in every episode of her popular AutoCAD Video Tips. Subscribe to the free Cadalyst Video Picks newsletter, and we'll notify you every time a new video tip is published. All exclusively from Cadalyst!

Follow Lynn on TwitterFollow Lynn on Twitter

Is your company using or considering cloud-based applications for CAD or other design-related tasks?
We currently use cloud-based application(s) for CAD-related tasks.
We expect to begin use within one year.
We expect to begin use within five years.
We're exploring our options, but we don't have an implementation plan.
We don't foresee using cloud-based application(s) for CAD-related tasks.
Submit Vote

Download Cadalyst Magazine Special Edition