Command Processors on GPUs, and How They Affect Performance

A GPU is actually an extremely complex type of processor, a heterogeneous system made up of several different types of units that have to be coordinated to give a coherent result. In this article we are going to describe the command processors, the part of the GPU in charge of this task.

In every GPU there is always a central part that, regardless of the architecture and the brand that we talk about, is common in all of them, it is the command processors, the unit in charge of automatically managing the operation of the dozens of different units that exist. on a GPU.

What is a command processor?

The command processor of a GPU is a microcontroller in charge of reading the screen list generated by the CPU, to do so it makes the DMA unit serve in the GPU itself to access not the VRAM but the main RAM of the system where this is stored. command list. After finding the screen list in RAM, it copies it to the internal memory of the microcontroller.

The list of commands includes all the instructions that the different units of a GPU have to execute to render an image, either in 2D or 3D, but since the arrival of DirectX 11 to the PC, the so-called Compute Shaders have arrived, these are shader programs that are not associated with the graphical pipeline and that allow the use of the GPU to solve algorithms in which the CPU is less efficient.

Nowadays, a GPU is not only used to render impressive graphics for video games, it has many other uses and is used in several different markets, but the evolution of graphics cards towards these markets has gone in parallel with the evolution of the command processor and its possibilities.

What does asynchronous computing mean?

First of all, it should be clarified that Compute Shaders are also used in the case of the graphic pipeline, especially in post-processing and pre-processing of the image. For example they are used to calculate lighting in delayed rendering. In those cases, because the execution of the Compute Shaders depends on the execution of the rest of the graphical pipeline, we say that it is synchronized, but there are tasks that benefit from the use of the GPU and that are not part of the rendering of the scene, therefore they work asynchronously.

To be able to visualize it better, we only need to see two different situations:

In the first one we are making bread but we find that we lack flour and therefore we ask someone not to go and get it, this means that we cannot do anything while we wait for the flour to be brought to us.
The second situation comes from the first, because we cannot make bread so we decide to wash the dishes. Something that we can do at any time and that has nothing to do with it.

The designers of the different GPUs realized that in all the GPUs there were bubbles in the execution where some parts of the GPU of these did not do anything in small periods of time. That is why a few years ago they decided to implement asynchronous computing and collaborate in the development of APIs that make use of these, such as DirectX 12 and Vulkan.

What are command lists?

Today, the CPU itself is in charge of making the different command lists, either through a single core or multiple cores to create them in parallel. In video games, a core is usually assigned to create the list of graphics, which is much more complex than the others and usually originates from a single memory ring. The lists of commands for computing are much simpler, they seek that the shader units solve a specific problem and provide the solution.

In the case of the lists of commands for computing, these are usually made up of several different lists, which can be resolved simultaneously with each other and with respect to the screen list. The reason for this is that they are asynchronous and therefore do not depend on each other to function, this makes them totally independent and allows to take advantage of parts of the GPU that would otherwise be wasted due to inactivity.

The other type of commands are those related to accessing the system’s RAM or VRAM, these commands are executed in both computing and graphics. In the case of graphics, memory operations are done solely and exclusively in VRAM, while in computing mode the data can be imported or exported both in RAM and in VRAM, since in some cases the GPU responds to a computation request from the CPU.

Graphics APIs and command processors

Originally the graphics list and the compute list were managed together, which was totally inefficient. It was not until the advent of GPUs with separate command processors for graphics and computing, with the ability to operate synchronously and asynchronously with each other, that they were not able to handle several different command lists in parallel.

The command lists are also called ring buffers, the reason is that each command processor is assigned one or more memory addresses in a list, when it reaches the memory address that it can access then the memory starts again. loop again. It is as if it goes around in circles. And that’s why we call it a ring buffer or Ring Bufffer in English. That is why we have represented them in the form of small rings in the diagram above.

Types of Command Processors

There are different types of command processors, each one has its utility and depends on the type of market for which the graphics card is directed, it uses one type of command processor or another:

Graphics only: it is in total disuse as of today, since in the past there was only one command processor and it was for graphics exclusively.
With an intelligent scheduler: one of the things when it comes to managing several command lists in parallel, specifically for computing, is that it must be the system’s own CPU that generally coordinates the execution of the different command lists. A command processor with an intelligent scheduler is able to reorder the command list in real time without CPU intervention.
Compute only: used in scientific and high-performance computing, these GPUs cannot generate graphics as they do not have a graphics command processor or are idle. This is the case of CDNA GPUs for AMD Instinct, different NVIDIA Tesla and different graphics cards for computing.
Virtualized: used in data centers, especially for cloud computing. They allow to handle several lists of graphical commands at the same time, which are independent of each other. Each list corresponds to a virtual machine running a different operating system for a different user remotely.

Interaction of the command processor with the rest of the GPU

The command processor does not process any program, but is a great organizer that is responsible for distributing the tasks among the different units available at all times. If we talk about the graphics command processor then it will have access not only to the shader units of the GPU, but also to the fixed function units. In computing, on the other hand, it has access only to the shader units and the way of operating the command processors for computing is different.

How do the different units coordinate with each other? Well, each fixed function unit and shader unit has a kind of mailbox that can send and receive messages in two different directions:

When exporting data, the shader unit can export to a lower level of the cache, to a fixed function unit, to another shader unit or even to the RAM assigned to it, be it a type of RAM or VRAM.
Regarding the import of data, it is the command processor and the sending unit that are responsible for sending the data to the shader unit. From time to time the command processor is the one that fills the data and instruction caches of each shader unit with the tasks it will have to perform, since shader units do not have the ability to capture instructions like a CPU.

It goes without saying that in the list of instructions and data that the command processor sends to each unit there is a final command that tells it where to export this data once it has finished calculating it. Which units receive the lists of data and / or instructions to be processed and where they are sent are up to the command processor, which performs the task without us having to worry.