With New Turing, NVIDIA Doubles Down on the Future of Real-Time Ray-Tracing18 Oct, 2018 By: Alex Herrera
Herrera on Hardware: CAD professionals are expected to reap the benefits.
NVIDIA's long-expected successor to its Pascal GPU architecture for gaming and professional graphics is here. In August, the GPU developer pulled the covers off Turing, which one could argue is both a successor to not one but both of its preceding generations of graphics processing units — 2016's Pascal and 2017's Volta. In the process, the company confirmed several of the more expected 3D graphics advancements for its next flagship GPU. But it also revealed a few surprises, representing an aggressive-but-justified departure from past generations' decisions about how it formulates products that today are destined for a far wider spectrum of applications than those of years past.
Perhaps even more significant is the inflection point in GPU evolution that Turing marks, a unification — if not permanent and all-encompassing, at the very least meaningful — of the previous disparate and often conflicting priorities between the GPU's traditional 3D graphics markets and the hot emerging opportunities attracting the company's attention. With Turing, NVIDIA confirms two realizations very much reflected in the GPU's DNA: one, that machine learning is now a valid and justified tool to enhance 3D visual computing; and two, that the time is ripe to begin the long-awaited transition from 3D raster graphics to the ultimate in rendering, real-time ray-tracing.
Turing, RTX, and NGX
NVIDIA is finding more ways to leverage machine learning to improve performance and quality for traditional 3D graphics. It's been a while since NVIDIA shaped new GPU architectures and technology strictly for the benefit of traditional raster-based 3D graphics that CAD applications and users have primarily relied upon. Over the past decade, NVIDIA GPUs have pushed well beyond that core space and into high-performance computation ("compute"), autonomous vehicles, robotics, supercomputing, and now, front-and-center, machine learning. And each new generation has walked a careful balance, supporting new applications without handicapping the GPU for its bread-and-butter 3D graphics markets.
With Turing, NVIDIA made many of the more conventional improvements to its fundamental 3D graphics programmable shader engine, the Streaming Multiprocessor (SM), especially in terms of critical resources like chip registers and cache, and dialed up supporting infrastructure including external memory bandwidth — all good things that contribute to faster, higher-quality interactive 3D graphics crucial to improving the CAD experience and productivity. But those tweaks represent the more expected steps along the tried-and-true GPU evolution path, taking on cost and complexity for features and performance the company is pretty darn sure ISVs and end users alike will value in the near term, if not immediately. More noteworthy than the more conventional 3D graphics features Turing added is what it didn't subtract from the company's previous compute/artificial intelligence (AI)–focused GPU, Volta. With Turing, NVIDIA architects not only didn't strip out Volta's Tensor Cores, they improved on them — and doubled down on the pursuit of real-time rendering by boosting ray-trace-specific acceleration.
Tensor Cores accelerated AI for ray-tracing — and now they speed conventional 3D raster graphics to boot. Volta's most noteworthy advancement was the inclusion of Tensor Cores, new hardware engines — of significant incremental chip cost (i.e. transistors/silicon area) — to accelerate processing of deep neural networks (DNNs), the lifeblood of machine learning applications. Now, given Volta's primary focus on high-performance computing rather than 3D graphics, the choice to take on the silicon cost of Tensor Cores was certainly novel, but not particularly contentious.
But unlike Volta and its compute focus, Turing is a graphics-first architecture, so NVIDIA's decision to keep Tensor Cores in Turing raises a very pertinent question: Why would the company dedicate significant cost in a graphics-focused GPU to a feature that doesn't directly benefit graphics? Well, the answer is most interesting and fortunate: That old premise is no longer true, and NVIDIA is now finding compelling ways to leverage machine learning to improve the quality and performance of 3D imagery.
As discussed in detail in a previous column, "What Does NVIDIA's Ray Tracing News Mean for the CAD Market?" NVIDIA figured out a way to leverage deep learning to significantly improve the performance of ray-traced 3D rendering. Specifically, RTX software exploits Tensor Core hardware incorporated in a DNN in the ray-tracer to accelerate image "convergence" by decreasing the computational load in the latter stages of rendering. Once the image converges into something it can recognize, AI fills in remaining rays/pixels, intelligently de-noising the image and wrapping up the time-consuming rendering process far faster than is possible via exhaustive, full-resolution ray processing.
RTX technology (right side) accelerates ray-tracing through AI-accelerated de-noising, compared with the same number of rays without de-noising (left). (Source: NVIDIA)
No doubt enthused by the successful synergy of machine learning and graphics with ray-trace processing, NVIDIA researchers began exploring other ways to extract more visual processing goodness out of its GPUs' AI prowess. Extending on the use of DNNs for ray-trace de-noising, NVIDIA unveiled NGX technology, comprised of an expanded set of DNN-driven image-enhancement features.
The most relevant and compelling example using Turing/NGX to enhance conventional 3D graphics is Deep Learning Super Sampling (DLSS). Essentially, DLSS benefits from a Tensor Core–accelerated DNN that substitutes the usual brute-force pixel super-sampling with intelligent choices based on the scene geometry, and based both on one frame instance and interframe temporal changes. The benefit is that quality improves at the same performance level, or likely more interesting for most applications (because resolution-dependent quality is pretty darn good at this point), performance increases significantly at the same quality level. I have no doubt NVIDIA sees NGX today as anything but a fixed set of features, but rather an evolving and expanding toolbox of DNNs that can further harness machine learning for the benefit of NVIDIA's traditional visual markets as time goes on.
AI-enabled DLSS anti-aliasing: There's great value in any GPU feature that can deliver the same quality in fewer cycles. (Source: NVIDIA)
And NVIDIA Doubles Down with RT Cores
Not only did NVIDIA preserve those Tensor Cores when creating Turing, it took the further step of adding multiple instances (one per SM) of an entirely new core design: the RT Core. Specifically, the RT Core takes on a critical ray-tracing computing task, one that when executing on previous GPUs' SMs proved cumbersome, inefficient, and time-intensive. Determining whether a ray (shot from a viewport out into the scene) actually intersects an object (and which triangle on that object's surface) is one of those tasks that a traditional raster-based 3D shader wasn't designed to do, and therefore doesn't do particularly well. With Turing, that job is now left to the RT Cores, freeing up the SMs to spend cycles instead on the 3D shader processing they're more adept at executing.
Turing's Streaming Multiprocessor with RT Core. (Source: NVIDIA)
What's the internal micro-architecture of the RT Core look like? Well, NVIDIA hasn't exposed the guts, but given its task — a lot of 3D geometry processing — there's little doubt some high-performance vector and matrix floating-point units form its foundation. What NVIDIA has disclosed is the performance of the RT Core: 10 GigaRays/second. Now, in absolute terms, that's a hard number to assess — like triangles/second in the rasterization world — as it all depends on the workload of each ray measured. The more relevant, apples-to-apples comparison is that rate relative to Pascal's 1.1 GigaRays/second, presumably of the same workload per ray. Given that, Turing is packing a 10X performance improvement processing a crucial and demanding portion of the ray-tracing pipeline.