Workstation CPU Cores and Clocks: An Inconvenient Tradeoff20 Aug, 2020 By: Alex Herrera
Herrera on Hardware: The inverse relationship between core count and frequency presents a tricky tradeoff to navigate when choosing a CAD workstation CPU.
Choosing the right CPU for a CAD workstation isn’t easy. For most of the products we shop for, we mainly concern ourselves with getting the most performance and features, while working within our budget. The proposition is both well understood and straightforward: the more you pay, the more you get of both. But it’s not that simple when it comes to narrowing down to the best processor to drive your CAD workflow.
The problem? First, there are two primary axes of performance to consider — core count and clock frequency — and second, they scale inversely with respect to each other, such that the more you have of one, the less you have of the other. That is, all else being roughly equal, the more cores you buy, the lower their operating frequency, and vice versa. More cores are better for some CAD workloads, but higher frequencies are better for others. Consequently, buyers looking to optimize productivity across their entire workflow face a dilemma.
We’ve looked at this dynamic before, for example in my 2015 piece titled, “More CPU Cores or Faster CPU Clocks?” And we’ve also explored how the CAD community has dealt with that tradeoff, often prioritizing frequency over core count to best handle the majority of their workflows. This time, we’ll take a slightly more quantitative look at where that tradeoff stands today. And we’ll also revisit that default (or at least common) choice to err in favor of higher GHz rather than more cores.
Multicore Was the Right Architectural Path to Take, But It’s No Cure-All
The multicore CPU era got its start in the mid-2000s, marking a sensible and essential turn from the exclusive focus on advancing superscalar techniques to drive computing throughput forward. Designed to squeeze as much parallelism as possible from a single thread of code, the industry rode superscalar refinements for years, but ran into two major roadblocks: thermal limits and diminishing performance returns. With the low-hanging fruit of superscalar techniques long picked, architectures were getting incredibly complex yet yielding much more modest generation-to-generation returns. Worse, relying on ever-higher clock frequencies to achieve that goal was pushing thermal output to levels beyond the means to cool them.
By contrast, moving to multicore let engineers take each generation’s additional transistor budget to double core counts, thereby doubling the theoretical aggregate throughput at the same frequency while easing (albeit not nearly eliminating) thermal challenges. But as sensible an approach as multicore is, there’s a catch. Because while there are many multithread-friendly workloads that can realize the theoretical level — or at least get close — for many tasks, the theoretical doesn’t track the actual.
This is due primarily to two reasons: Software places limits on parallelism, and ever-present thermal and electrical limits will always constrain clock frequency and voltage. Some code can be threaded effectively to take use of all that additional parallel horsepower, but some cannot. So the focus on replicating cores hasn’t been nearly as beneficial to a whole lot of legacy software, based on algorithms that remain fundamentally sequential in nature. Parametric modeling, the foundation of so many CAD workflows, as well as 3D graphics (though bear in mind those tend to be more limited by the GPU than the CPU) are classic examples. The more chips grow in size, the more power consumption is concentrated on a single piece of monolithic silicon, the more difficult it is to dissipate the resulting heat — and the bigger the die, the more difficult it is to maintain adequate signal integrity. So as core counts rise, we see that inverse relationship, where the minimum guaranteed operating frequency (the base GHz) declines.
The Unfortunate Inverse Relationship: Core Count vs. Frequency
That’s made for a frustrating tradeoff for workstation professionals in two respects. One, performance for still-essential single-thread tasks gets penalized for having all those cores they don’t even use. And two, the more cores you buy to boost your heavily threaded workloads, the less each incremental core provides to boost your throughput — another example of diminishing returns.
I charted the base frequencies for popular workstation CPUs, both for mainstream single-socket machines — Intel’s Xeon W-2200 family — and for high-end dual-socket machines, looking at Intel’s second-generation premium Xeon Scalable Gold line. In both cases, I charted the SKUs that represent the highest base frequency at the given core count. The decline in base frequency is clear: It drops from 4.1 GHz at 4 cores (4C) down to 3.0 GHz at 18C, and further to 2.7 GHz at 28C.
Charting the core counts and base frequencies for Intel’s Xeon W-2200 and second-generation Xeon Scalable Gold 62xx, choosing the highest frequencies available in workstations at the given core count. (Data sources: Intel, Dell, HP, and Lenovo.)
Processor vendors have mitigated the frequency penalty to some degree with technologies like AMD Turbo Core and Intel Turbo Boost Technology (most recently advanced in Intel Turbo Boost Max Technology 3.0). While a technically elegant and absolutely sensible use of available, temporary thermal headroom, Turbo levels can’t last indefinitely, with the maximum duration a function of many situational factors. Short-term overdrives are of value in many workloads, but tend to be more useful when demand comes in bursts. In consistently heavy-duty CAD simulations or rendering, it’s the base frequency that users can count on, and the same can apply to extended high-demand single-thread processing.
Erring on the Side of Higher GHz
And there are the rock and the hard place that users find themselves stuck between. Relying on few-core processors is becoming a more and more problematic choice, since it can decimate performance for the increasing number of multithreaded to massively threaded applications that add value and drive up productivity for workflows dependent on design, modeling, rendering, analysis, and simulation. But they’re not ready to watch the performance for their must-have single-thread workloads suffer either by significantly upgrading their core counts.
Though perhaps not privy to its extent, the CAD user base has long been aware of the core count versus frequency tradeoff. And with respect to both market sales metrics and anecdotal observations on what constitutes the prevailing shopping wisdom, it’s clear the mainstream of the market has erred toward frequency, today just slowly migrating from the ubiquitous quad-core models up to hex-core and 8-core SKUs. Pricing has, of course, helped keep the median core count down, because with the exception of the lunatic-fringe-frequency SKUs, costs rise along with core counts.
Now, sticking with fewer core CPUs can make sense, but it’s certainly worth doing so thoughtfully, with an understanding of both the severity of the tradeoffs and how much the downside to that tradeoff might hurt.
Benchmarking the 12C Xeon W versus 56C dual-socket Xeon Scalable
The core count versus frequency chart above hints at two conclusions: one, that single-thread performance will decline as core count increases, and two, that even multithread-capable workloads will see diminishing returns on throughput at higher core counts. But of course, validating those conclusions (and hopefully, measuring the impact to some degree) doesn’t come with qualitative arguments. It takes some experimentation with real systems. Fortunately, I had the opportunity to get my hands on two workstations to benchmark: one built around a 12-core Intel Xeon W-2265 and the other based on dual 28-core Intel Xeon Scalable Platinum 8280 for a total of 56 cores, representing the highest core count available today. Key system specs are shown below.
Key system specs for workstations tested.
I kicked off testing with what I think is the most accurate reflection of the type of workloads that professionals actually use, captured in SPEC’s SPECworkstation 3.0.4, specifically the CPU-specific workloads grouped in their Product Development test suite most relevant to CAD computing, in particular simulation (you’ll note the “CFD” suffix, referring to actual code in use for computational fluid dynamic analysis).
Normalizing results to the 12C Xeon W-2265 machine, we can see significant speedups for the dual Xeon Scalable Platinum 8280 machine, which averaged about 2.3 times faster. Also charted and worth noting, though, is the right-hand axis which charts that speedup relative to the theoretical speedup we’d see if performance scaled linearly with core count. The results show that the efficiency of the speedup is closer to 50%. That is, we’re seeing a return of about 50% on all those extra cores we purchased. Not bad — approaching 100% would be an unrealistic goal — but the diminishing returns are undeniable.
SPECworkstation 3.0.4 Product Development results, normalized to the Xeon W-2265 machine.
Secondarily, I ran both Cinebench R20 and PassMark PerformanceTest 10.0’s CPU Mark, both of which are also designed to focus the majority of the stress on the processor (rather than other system components). Presenting the data in similar fashion, the dual-socket 56C Xeon Scalable workstation outperformed the 12C Xeon W-2265 by around 3.2X overall on both Cinebench and the CPU Mark, and a bit better on some of the more workstation relevant subtests shown (e.g., Physics, Floating Point Math, and SSE). But again, the diminishing returns are there, as the CPU Mark came in at about 68% of the theoretical linear-by-core-count gain, while Cinebench actual gains were closer to 50% of the theoretical. (It’s also worth noting that the dual-socket Xeon Scalable machine has the benefit of much greater memory size and bandwidth, and that will help its cause somewhat.)
The tests showed surprisingly tight correlation on speedups, with SPECworkstation 3.0.4 Product Development, PassMark CPU Mark, and Cinebench R20 showing overall speedups of 2.3 to 3.5X. And clearly, even with multithread-friendly workloads, adding cores does not scale with the theoretical, linear rate by core count. On the contrary, and particularly when moving up to maximum 56 cores in a dual-socket machine, the returns diminish. Certainly, a key factor in those diminishing returns is decline in clock rates, with base clocks dropping from 3.5 GHz to 2.7 GHz.
Furthermore, users will have their own differing perspectives on how valuable that 2.3 to 3.5 gain is, moving up from that modest 12C Xeon W-2265 system to the ultra-high-end dual Xeon Scalable Platinum 8280 machine. And bear in mind, realizing that gain isn’t a costly proposition, as both the dual-socket models and Xeon Scalable CPUs come at a premium, and like the performance scaling, the price tags don’t rise linearly. For some, the price adder is far less of an issue — or even inconsequential — because the job demands the highest performance, period. For them, price-performance is a moot point, if maximum performance isn’t there.
But — and here’s the rub — not only do you have to give up a lot more dollars to get average gains of 3X or so, but with that drop in clock rate, you also have to give up some single-thread performance. How much? To gain some insight, I also ran Cinebench R20 and PassMark PerformanceTest 10.0 CPU Mark on both machines, but with execution limited to a single thread (1T, which is not possible on SPECworkstation 3.0.4). Constrained to one thread, the tables were turned, as the expensive, massively core’d dual Xeon Scalable Platinum workstation trailed the more modest 12C Xeon W-2265 machine by 18% and 12%, respectively. So yes, that tradeoff is there and can be significant, depending on your workload priorities and your buying dollars.
Single-thread (1T) test results for similarly configured systems.
Multicore and Superscalar Architectures Won’t Be Enough Moving Forward … Neither Will Sticking with Few-Core CPUs
It’s worth emphasizing that while the Xeon W-2265 is fairly mainstream, the dual Xeon Scalable Platinum workstation is most certainly not. It’s not something that would be on the radar for 99% of buyers. And the analysis here wasn’t meant to single out that platform, but rather to use it as an upper-end limit on that tradeoff of multi-thread scale-up and single-thread scale-down.
As the data shows, that tradeoff is real, and it presents a non-trivial decision point for CAD workstation buyers. It makes it easy to understand why that de facto rule of thumb to “take the GHz over the core count” carries weight in mainstream CAD circles. If your day is mostly about modeling with interactive graphics, with scant few other computing tasks, then why spend more to possibly give up performance on your bread-and-butter workflow? But that prevailing wisdom certainly doesn’t apply to all — consider those tasked with extensive simulation, rendering, and analyses — and I’d argue it will apply less and less in the future. While it of course depends on the extend of your specific workflow, for many, those workflows will evolve moving forward to take on more and more compute-intensive multi-threaded workloads. So while four- and six-core CPUs might satisfy the bulk of the CAD community today, I expect that median to rise over time.
Fortunately, CAD users aren’t the only ones aware of the seeming paradox of CPU performance due to the competing benefits of core count and frequency. CPU vendors and OEMs are aware as well, and looking to soften the tradeoffs and get closer to offering best-of-both-worlds computing with a more equitable speedup of applications across the breadth of CAD workflows.