Summary: In early 2018, Nvidia showcased real-time ray tracing using their DGX station supercomputer and the DXR API. The Turing architecture introduced in late 2018 had significant improvements enabling real-time ray-tracing. The RTX AIBs featured fixed-function RT inference cores for ray tracing acceleration. Nvidia also introduced DLSS, leveraging AI to improve gaming performance. Turing GPUs incorporated tensor cores for matrix math processing. Texture-Space Shading allowed for efficient caching and sampling of shading results. The Turing architecture laid the foundation for dedicated ray tracing and tensor engines, full mesh shading capabilities, and advancements in real-time rendering. Nvidia’s DLSS 3.0 further improved performance, enabling 4K real-time ray tracing at respectable frame rates.
Ray tracing is a complex process of generating light rays that illuminate a visible scene combined with the bounces and reflections of those visible rays. Millions, if not billions, of rays are generated, and their luminosity and color must be calculated for every instance of interaction they experience—it is a horrendous calculation workload and is aggravated by resolution because the number of visible pixels goes up as the product of X ×Y.
Nvidia introduced their GeForce RTX and Quadro RTX AIBs at Siggraph in Vancouver, BC, in August 2018.
When the GeForce RTX 20 series AIBs were introduced in September 2018, Nvidia promoted DLSS as a prominent feature. However, initially, its application was limited to a few video games, such as Battlefield V and Metro Exodus. This was because the algorithm required individual training for each game, and the outcomes were generally not as impressive as basic resolution upscaling.
In the context of geometric objects, a bounding volume hierarchy (BVH) is a tree structure utilized to organize and categorize the objects. The objects themselves serve as the nodes or leaves of the tree, and they are enveloped by bounding volumes. These nodes are further organized into smaller sets and enclosed within larger bounding volumes. This recursive grouping process continues, with the nested nodes being enclosed within increasingly larger bounding volumes, ultimately forming a tree-like structure where a single bounding volume encompasses the entire hierarchy. Bounding volume hierarchies find application in collision detection (inference detection) in ray tracing.
Solving the ray-tracing algorithm was step one. The other, maybe larger, problem was managing the screen resolution.
In 2017 Nvidia was enjoying the take up of GPUs in the data center for compute acceleration. At the same time, AI training was getting into full swing, and as fate would have it, AI training was (is) an ideal candidate for parallel processing—the GPU’s major strength.
But AI needed another type of recursive processing that involved matrix math. Matrix can be done on a GPU, but it’s not the most efficient engine. Other approaches tried by other firms were DSPs and dedicated matrix-math processors constructed in FPGAs. DSPs were (are) used successfully in mobile devices like smartphones, and Qualcomm is a leader in that area. FPGAs are functionally effective but not fast. A new, dedicated matrix -math processor was needed. Google was one of the first to offer such a device which they named a tensor processor. The definition “in mathematics is a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space.” But the straightforward way to think about it is a tensor processor does 3D processing in real-time, and it is visualized as the diagram in Figure 2.
So now the Nvidia Turing GPU being employed on the new RTX 2080 AIB (and Quadro RTX AIB) had two new ASIC-like math engines—inference engines for ray tracing, and tensor engines for matrix math. The Turing GPU was the first general-purpose consumer AI processor.
One of the other features of the new GPU was Deep Learning Super-Sampling (DLSS). Nvidia claimed tensor cores “could be used to improve gaming performance through DLSS. DLSS leverages a deep neural network to extract multidimensional features of the rendered scene and intelligently combine details from multiple frames to construct a high-quality final image. This allows Turing GPUs to use half the samples for rendering and AI to fill in the information to create the final image. The result, said Nvidia, is an image with similar quality as traditional rendering (which typically relies on Temporal AA in most of today’s latest games), but with higher performance. Nvidia said DLSS was an easy integration for developers, and at the time (fall 2018), developers announced 25 games would have DLSS support.”
The Turing GPU was the first general-purpose consumer AI processor.
The power GPU which already had thousands of SMID processors, video CODEC processors, audio processors, display and raster operation processors, memory managers, and I/O processors now got two new processors, RT, and Tensor. “The Turing GPU was a tour de force.”
The Turing architecture introduced important enhancements to the core execution data paths. Previous shader workloads consisted of a combination of floating-point (FP) arithmetic instructions, such as floating add (FADD) or floating multiply FMAD, along with simpler instructions like integer adds for addressing and data retrieval, floating-point compares, or min/max for result processing, and more. “In prior shader architectures, the floating-point math data path remains inactive whenever one of those non-FP-math instructions was executed. Turing addressed that limitation by incorporating a second parallel execution unit alongside each CUDA core.
Within the new architecture, the primary enablers for Turing’s improvement in graphics performance were a new GPU processor (streaming multiprocessor—SM) architecture with improved shader execution efficiency and a new memory system architecture that includes support for the latest GDDR6 memory technology.
The Turing GPU broke all records for superlative numbers with18.6 billion transistors (fabricated on TSMC’s 12 nm FinFET process), 14.2 TFLOPS of peak single precision (FP32) performance, and two new measurements, 113.8 Tensor TFLOPS and 10 Giga Rays/sec.
“Turing had 4608 shaders (CUDA cores) capable of mesh-shading.” Mesh shading advanced geometry processing architecture “by offering a new shader model for the vertex, tessellation, and geometry shading stages of the graphics pipeline, supporting more flexible and efficient approaches for the computation of geometry.“
Another new feature introduced with the Turing architecture was variable rate shading (VRS). VRS offers dynamic control over the frequency of shading in graphics rendering. Developers can adjust the shading rate, ranging from once per sixteen pixels to eight times per pixel. VRS enables more efficient shading by reducing unnecessary work in areas of the screen, “where full-resolution shading would not visibly enhance image quality. Various categories of VRS-based algorithms have already been identified, including Content Adaptive Shading (which adjusts shading based on content level of detail), Motion Adaptive Shading (which varies shading based on the rate of content motion), and Foveated Rendering (which considers lens resolution and eye position, particularly in VR applications).”
Texture-Space Shading was yet another new development in the Turing architecture. “With texture-space shading, objects are shaded in a private coordinate space (a texture space) that is saved to memory, and pixel shaders sample from that space rather than evaluating results directly. With the ability to cache shading results in memory and reuse/resample them, developers can eliminate duplicate shading work or use different sampling approaches that improve quality.”
The Turing architecture also introduced a new advancement known as Texture-Space Shading. This technique involves shading objects within a dedicated coordinate space called texture space, which is stored in memory. Instead of directly evaluating results, pixel shaders sample from this texture space. By caching shading results in memory and reusing or resampling them, developers can effectively eliminate redundant shading computations and employ alternative sampling methods that enhance overall quality.
The GPU was big, 754 mm2, and 18,600 transistors—a new high in both dimensions. It also was a heavy user of power sucking up to 280 w.
But Turing was the foundation of a new generation of GPUs with dedicated ray tracing and tensor engines, and full mesh shading capabilities. Nvidia paid for being first and had to encourage and support the game developers to embrace the new rendering approach. Most of them welcomed it, but game development takes time, so major titles didn’t show up for a year or more. AMD held back on implementing RT acceleration, but when Intel entered the market, they offered it and their XLSS scaling capability with the launch of their Xe GPUs in September 2020. AMD launched tier-first hardware RT accelerated GPU, the Radeon RX 6000 in October 2020. By then Nvidia was on its second generation and preparing its third.
In mid-2023, Nvidia was the undisputed leader in real-time ray tracing and shipping DLSS 3.0, which improved performance even more and allowed 4K RTRT at respectable frame rates.
You can find out more about ray tracing technology, history, and applications in Dr. Peddie’s book, Ray Tracing: A Tool for all.
Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.