Load-Store, Description and Utility of These Units in CPU and GPU


One of the essential pieces of any architecture are the Load-Store units, which are responsible for executing the memory-related instructions on both the CPU and GPU. If you want to know what the function is and how these units work in a simple and accessible way then keep reading.

The communication of the CPU with the memory is important, here at HardZone we have made several articles to explain the different elements and now it is the turn of the Load-Store units, which are essential and therefore essential in any architecture both CPU and GPU.

What are Load-Store units?

Load-Store Unit

It is an execution unit in a CPU, execution units are those used to resolve an instruction once it has been decoded. Let us remember in passing that there are the other types of execution units:

  • ALUs: are different types of units that are responsible for executing different types of arithmetic operations. They can work with a single number, a string of numbers, or even in a matrix.
  • Jump unit: these units take the jump instructions in the code, which is that execution moves to another part of memory.

The Load / Store units, on the other hand, are in charge of executing the instructions related to accessing the RAM memory of the system, whether read or write. There is no L / S unit, but there are two types of units that work in parallel and that manage access to data.

The simplest description of its operation is as follows: a Load unit is in charge of storing information from RAM memory to the registers and a Store unit does it in the opposite direction. To function, they have their own memory for this type of unit, where they store the memory requests for each instruction.

Where are the Load-Store units located?

Load-Store localización

The first thing we can think of is that the Load / Store units are as close to the processor as possible, but despite the fact that their job is to move data from the RAM to the registers, they do not have direct access to the RAM, but rather that Another mechanism that we already talked about in: ” This is how the CPU accesses the RAM memory so quickly ” is in charge, where we talk about the communication of the memory interface of the CPU with the RAM.

In its simplest conception, the Load / Store units communicate with the interfaces that communicate the processor with the RAM memory, specifically with the MAR and MDR registers, and are the only units with permission to manipulate said registers, as well as to transfer the data to the different registers for the execution of certain instructions.

Therefore, the Load / Store units are not located in the part closest to the memory, but are located halfway between the registers of the registers of the different execution units and the memory interface that is used in every processor. found in perimeter.

Adding a cache hierarchy

Cache primer nivel spllit

The cache is nothing more than internal memory to the processor that copies the data closest to where the code execution is at that moment. Each new level in the hierarchy has greater storage capacity, but at the same time it is slower and has higher latency. Instead in the reverse way, each cache level contains a portion only of the previous one, but it is faster and with lower latency.

In current CPUs, all levels contain information about instructions and data in the same memory except for one level, which is the lowest level cache. Where there is a cache for instructions and another for data. Load / Store units never interact with the instruction cache, but with the data cache.

LSU Data Cache

When the Load units within each kernel need data, the first thing they do is “ask” the data cache if they contain the information for a certain memory address. The operation is read-only so if they find it then they will copy it from the cache to the corresponding registry. If in a cache level it does not find it, it will go down level by level. Think of it as someone looking for a document in a pyramidal office building, where each level has more files to search for.

On the other hand, Store units are a bit more complex, they also search for a memory address in the cache, but from the moment we talk about modifying the data it houses inside it is necessary that there be a system of coherence that change the reference to that memory address throughout the cache hierarchy and in RAM itself.

RISC = Load-Store?


Once we have learned what Load / Store units do, we must give them historical context and that is that they are not the only way in which a CPU can access the system’s RAM to load and store data.

The Load-Store concept is related to sets of registers and instructions of the RISC type, where the set of instructions is reduced and one way to do it is to separate the process of accessing the memory of the different instructions in another instruction, such as several instructions. they will have a similar memory access process that uses Load / Store units to carry out that part.

The consequences are already known to us, the binary code of the programs for CISC instruction sets end up having a more compact and smaller binary, while the RISC units have it larger. Keep in mind that in the early days of computing, RAM was very expensive and scarce and it was important to keep the binary code as small as possible. Today all x86 CPUs are Post-RISC, because when decoding x86 instructions they do so in a series of microinstructions that allow the CPU to function as if it were a RISC CPU.

LSUs on GPUs


Yes, the GPUs also have Load / Store units, which are found in the Compute Units and are in charge of looking for the data that the ALUs of this have to execute. It must be remembered that the Compute Units from AMD, Sub-Slices from Intel or the Stream Multiprocessors from NVIDIA in the background are different signifiers for the same thing, the GPU cores where their programs run, known colloquially as shaders.

The different ALUs of a Compute Unit tend to work at the register level most of the time, this means that the instruction comes with the data to operate directly, but some instructions refer to data that are not found in the registers, so it is necessary to search for them through the caches.

The data search system is the same as in CPUs, first it looks at the data cache of each Compute Unit and works down until it reaches the end of the memory hierarchy as far as the GPU can access. This is essential when accessing large data such as texture maps.

Fixed feature on GPUs and Load-Store drives

Esquema SM NVIDIA Turing

Some of the units located in the Compute Units make use of the Load-Store units to communicate with the GPU, these units are not ALUs, but independent units of fixed function or accelerators. Today there are two types of units that make use of the Load / Store units in a GPU:

  • Texture filtering units
  • The unit in charge of calculating the intersection of rays in Ray Tracing

Since these units need to access the data cache to obtain the ditto as input parameters to perform their function. The number of Load / Store units in a Compute Unit is variable, but it is usually equal to or greater than 16, since we have 4 texture units that require 4 data to perform the bilinear filter.

In the same way, the data of the nodes of the BVH trees is stored in the different cache levels. In some specific cases, such as NVIDIA GPUs, Ray Tracing units have an internal LSU that reads from the RT Core’s own L0 cache.