The Traditional Computer Memory Hierarchy — and Its Impact on CAD Performance — Are Evolving19 Mar, 2020 By: Alex Herrera
Herrera on Hardware: The basic tenets of the tried-and-true memory hierarchy still apply, but it’s worth knowing how recent disruptors can improve performance — and perhaps shift the traditional balance.
Most of us know that the purpose of a computer’s memory (also commonly referred to as DRAM) is to store data and instructions, both for the operating system (e.g., Windows) and for our applications (e.g., SolidWorks). An application’s instructions are run by the processor, which reads and writes that data to and from memory as we load, design, and store our CAD models.
But fewer people realize that what we refer to as memory or DRAM is actually part of a larger, hierarchical data storage chain. DRAM is a critical component of that hierarchy, and one we heavily focus on when configuring and purchasing a new workstation, as insufficient memory capacity or bandwidth can cripple performance, throttling well below what the system’s otherwise high-powered CPU or graphics processing unit (GPU) would be capable of. But it’s not the only critical performance component, so understanding how it both supports and is affected by other subsystems — most notably storage drives — is helpful in determining the type and speed of memory best suited to your CAD workstation.
This month, I’d like to introduce the concepts and tradeoffs of different layers in the hardware memory hierarchy, and explain how the relative sizes can help — or hinder — performance on your CAD workloads. For now, we’re looking at this a bit more qualitatively, but I’m also planning a future column with more real-world quantitative metrics on how to help dial your next workstation hardware configuration’s CPU/memory/storage options to best streamline your workflow processing.
Why the Memory Hierarchy Exists, and How Its Structure Affects CAD Performance
With the exception of the computation that goes into 3D graphics, which is mostly the burden of the workstation’s GPU, all the code execution — for your CAD application execution as well as all the OS and user-processing overhead — falls to the CPU. And ultimately, how fast that execution completes is primarily a function of two limits: First, how fast that CPU’s internal datapaths and execution units can process instructions and perform the indicated mathematical and logical operations, and second, how well the system’s memory and storage subsystems can load and store data to and from the CPU to supply that instruction stream. And while we often focus on the GHz rating and core counts for the CPU model we spec in our workstations, how well your machine can handle that second limit that can often matter just as much in determining your ultimate computing throughput.
Now, when we talk about a computer’s memory subsystem, or perhaps more appropriately, when we consider the combination of memory and the storage drives supporting that memory, it’s worth understanding the multitiered makeup of that combination. That is, the computer components that might respond to a CPU’s call to read or write data are varied and structured in such a way as to complement each other and to provide the best balance of performance, cost, and practicality. All applications, but especially those with heftier dataset sizes processing concurrently across multiple cores — attributes more common to CAD applications than most others — benefit from quick-response, high-bandwidth service of data, because the less often busy CPU cores wait for data, the faster your workload gets processed. But as you might expect, the quicker the response and the higher the bandwidth of the device and overall subsystem, the more costly that subsystem becomes.
The long-standing and still-faithful representation of that hierarchy is a pyramid, where the highest-performance, highest-cost, and lowest-capacity (typically cache) element sits at the top, and conversely the lowest-performance, lowest-cost, and highest-capacity element sits at the bottom. Any time an instruction attempts to access data (for example, one of many parameters from your CAD models) that is not currently resident at one level, the data must be fetched from one or more levels below, which will be slower, both in terms of how long it takes to get the first piece of data (the latency) and the subsequent rate of data received after that first piece (the bandwidth).
But why a hierarchy and pyramid? Why not just populate all the memory you’ll need with the fastest possible option? The answer is pragmatic, both from a dollars and an engineering standpoint: Costs rise going up the pyramid, not linearly but exponentially, as do the practical limits of how to engineer maximum theoretical performance.
The traditional memory hierarchy pyramid.
And therein lies the domain of a huge amount of engineering and design history: how to best optimize where data is so that — as often as possible — the data is immediately available, and if not, it takes the least amount of time to retrieve it. We’re not going to get into the esoterics of all kinds of clever techniques to accomplish that (e.g., cache types, organizations, prefetching). Instead, we’ll focus on how the characteristics of performance, cost, and capacity can help indicate a good balance of DRAM and storage types and sizes when building a workstation for CAD.
The pinnacle of that pyramid, cache, ends up being generally beyond the purview of CAD users looking to customize. For those machines, cache is all integrated as static RAM (SRAM) in CPU silicon and therefore fixed per CPU model, and as such we all rely on the vendor’s (Intel and AMD) choices about how to best balance the CPU’s throughput capabilities (i.e., number and speed of cores and microarchitecture) with cache size and organization. But there are user-configurable choices in type and size for the lower tiers of that pyramid, with respect to both DRAM and storage drives.
The run-time workhorse of the pyramid is system memory, also known as DRAM or just memory. Dynamic random access memory (DRAM) is just one type of memory device, but today it happens to be the ubiquitous type making up system memory, which is why DRAM has become a synonym for many. Attempt to access a page of data not currently stored in DRAM, and your computing throughput will likely experience a hiccup similar to a cache miss — the pipeline stalls while the data is fetched — but via a different mechanism with much longer latency. Where accessing an address that misses in cache causes that data to be retrieved from DRAM, the next level, a page fault initiates the retrieval of a page from system storage below. Sizing DRAM appropriately to minimize page faults is one of the most important ways to tune a data-intensive workflow.