Multi-GPU gaming on chiplets: architecture and features for PC

Multi-GPUs by chiplets are just around the corner and although we will first see them in the form of HPC cards, and therefore outside the gaming market, we have long known that the evolution is towards the construction of graphics cards based on Multi- GPUs per chiplets. But what do they bring compared to a conventional monolithic GPU? Read on to find out.

The architecture that we discuss in this article is not yet available on the market, it has not even been presented, but it is the product of an analysis of the advances produced in recent years, as well as of the different patents on Multi-GPU chiplets that both AMD, NVIDIA and Intel have been publishing in the last two years. That is why we have decided to take that information and synthesize so that you have an idea of how these types of GPUs work and what graphic problems they come to solve.

Traditional 3D rendering with multiple GPUs

Using multiple graphics cards to combine their power to render each frame in 3D video games is not new, since the Voodoo 2 by 3dfx it is possible to divide the rendering work, totally or partially, between several graphics cards. The most common way of doing it is Alternate Frame Rendering, where the CPU sends the screen list of each frame alternately to each GPU. For example, GPU 1 handles frames 1, 3, 5, 7, while GPU 2 handles frames 2, 4, 6, 8, etc.

There is another way to render a scene in 3D, which is Split Frame Rendering, which consists of several GPUs rendering a single scene and dividing the work, but with the following nuances: a GPU is the master GPU that reads the list of screen and handles the rest. The first stages of the pipeline, prior to rasterization, are carried out exclusively on the first GPU, as for rasterization and the later stages are carried out equally on each GPU.

Split Frame Rendering seems an equitable way to distribute the work, however, now we will see what are the problems that this method entails and with what limitations it is.

The limitations of Split Frame Rendering and the potential solution

Each GPU contains 2 collections of DMA drives, the first pair can simultaneously read or write data in the system RAM through the PCI Express port, but in many graphics cards with Crossfire or SLI support there is another collection of DMA drives, the which allow access to the VRAM of the other graph. Of course, at the speed of the PCI Express port, which is a real bottleneck.

Ideally, all GPUs working together would have the same VRAM memory well in common, but this is not the case. So the data is duplicated as many times as the number of graphics cards involved in rendering, which is grossly inefficient. To this we have to add the way in which graphics cards work when rendering 3D graphics in real time, which has caused the configuration with multiple graphics cards to be no longer used.

Tile Caching on a Multi-GPU by chiplets

The Tile Caching concept began to be used from NVIDIA’s Maxwell architecture and AMD’s Vega architecture, it is about taking some concepts from rendering by tiles, but with the difference that instead of rendering each tile in a separate memory and only writing it to the VRAM when it is finished is done on the second level cache. The advantage of this is that it saves on the energy cost of some graphics operations, but the disadvantage is that it depends on the amount of top-level cache that is on the GPU.

The problem is that a cache does not work like a conventional memory and at any moment and without program control a cache line can be sent to the next level of the memory hierarchy. What if we decide to apply the same functionality to a chiplet-based GPU? Well, this is where the additional cache level comes in. Under the new paradigm, the last level cache of each GPU is ignored as memory for Tile Caching and the last level cache of the Multi-GPU is now used, which would be found on a separate chip.

The LCC on a Multi-GPU by chiplets

The latest-level cache for chiplet-based Multi-GPUs brings together a number of common characteristics that are independent of who the manufacturer is, so the following list of characteristics applies to any GPU of this type, regardless of the manufacturer .

It is not found in any of the GPUs, but is external to them and therefore is on a separate chip.
It uses an interposer with a very high speed interface such as a silicon bridge or TSV interconnects to communicate with the L2 cache of each GPU.
The high bandwidth required does not allow conventional interconnections and is therefore only possible in a 2.5DIC configuration.
The chiplet where the last level cache is located not only stores said memory, but is also where the entire VRAM access mechanism is located, which in this way is decoupled from the rendering engine.
Its bandwidth is much higher than that of HBM memory, which is why it makes use of more advanced 3D interconnection technologies, which allow much higher bandwidths.
In addition, like any last-level cache, it has the ability to give consistency to all the elements that are clients of it.

Thanks to this cache, each GPU is prevented from having its own VRAM well in order to have a shared one, which greatly reduces the multiplicity of data and eliminates bottlenecks that are the product of communication in a conventional multi-GPU.

Master and subordinate GPUs

In a graphics card based on a Multi-GPU by chiplets, the same configuration still exists as in a conventional Multi-GPU when creating the display list. Where a single list is created, which receives the first GPU that is responsible for managing the rest of GPUs, but the big difference is that the LLC chiplet that we have discussed in the previous section allows the first GPU to coordinate and send tasks to the rest of multi-GPU processing units per chiplets.

An alternative solution is that all the chiplets of the Multi-GPU will completely lack the Command Processor and this is in the same circuitry as where the LCC chiplet is located as orchestra conductor and taking advantage of all the existing communication infrastructure to send the different instruction threads to different parts of the GPU.

In the second case we would not have a master GPU and the rest as subordinates, but the entire 2.5D integrated circuit would be a single GPU, but instead of being monolithic it would be composed of several chiplets.

Its importance for Ray Tracing

One of the most important points for the future is Ray Tracing, which to work requires the system to create a spatial data structure on the information of the objects in order to represent the transport of light. It has been shown that if said structure is close to the processor, the acceleration suffered by Ray Tracing is important.

Of course, this structure is complex and takes up a lot of memory. This is why having a large LLC cache will be extremely important in the future. And this is the reason why the LLC cache is going to be in a separate chiplet. To have the highest possible capacity and make that data structure as close to the GPU as possible.

Today much of the slowness in Ray Tracing is due to the fact that much of the data is in the VRAM and there is a huge latency in its access. Keep in mind that the LLC cache in a Multi-GPU would have the advantages not only in bandwidth, but also in latency of a cache. Furthermore, its large size and the data compression techniques being developed in the labs of Intel, AMD and NVIDIA will make the BVH structures used for acceleration to be stored within the “internal” memory of the GPU.