What Architecture Do CPUs for AI Like Tensor Cores or NPUs Use?

Processors or CPUs dedicated to artificial intelligence or AI have been making a dent in recent years, although with different names. We have seen them appear in the form of Google’s Tensor Processor Unit or TPU, the Tensor Cores of NVIDIA GPUs, or the different Neural Processor Units or NPUs of various brands. But they all have one point in common: they are systolic arrays. In this article we will explain how these very specific processors work .

With the arrival of artificial intelligence, in recent years we have seen how different CPU manufacturers and designers have told us about different types of units to perform this function. What would happen if we told you that all these names are really different commercial nomenclatures for the same type of unit?

What Architecture Do CPUs for AI

The Basic CPU for AI: The Systolic Array

Systolic arrays are the basis for understanding how CPUs work for AI; They consist of a chain or array of processing elements, and each of these is directly connected to other processing elements through an interface that communicates them in an orderly manner with each other.

The first element in the chain is the one that receives the first data and therefore has contact with the I / O interface; said interface can be a memory, another processor of which the systolic array is a coprocessor or another systolic array. At the other extreme, the last element in the array will be the one that communicates with the element the systolic array is connected to and writes back the result of the entire joint operation.

IA Array Sistólico

Unlike in non-systolic processors where data is not transmitted between the different elements but always passes through the registers, in a systolic system the data is transmitted directly from a processing element or cell to the processing elements or closest cells.

The advantage of all systolic systems is that the communication between the processing elements is faster than the communication processing element → register → processing element → register, etc.

They are called systolic due to the fact that each element that is interconnected, performs its corresponding operation in a clock cycle and “pumps” the result to neighboring cells or processing elements.

Systolic Matrices and Tensors

In the same way, we can also connect the processing elements in a matrix way and have a systolic matrix, whose diagram is the one you see below:

IA Matriz Sistólica

We can even have a three-dimensional configuration that we call a Tensor.

IA Tensor Processor

The operation in all of them is the same, the difference is that in matrix and tensor systems we can move the data not only horizontally but also vertically and even diagonally in order to perform different types of operations in parallel.

Where does the name Tensor come from?

Tensor Core

Regular three-dimensional matrices are called a tensor, although it is used in all types of tensor processors, whether they are matrix or tensor type.

Processing element (PE)

The processing elements are usually ALUs with the ability to do addition and multiplication in parallel and simultaneously, but we can use other elements as processing elements, up to full cores and even place a systolic processor inside another.

Utility of systolic systems

Although they have become famous for the use of this type of processors in order to accelerate artificial intelligence algorithms, they have other uses such as:

  • Image Filters (Interpolation).
  • Search for patterns.
  • Correlation.
  • Polynomial Evaluation.
  • Fourier transformations.
  • Matrix Multiplication.
  • etc.

For example, the texture units of the GPUs, although they are fixed function units, are really configured as a systolic array, yes, they are not programmable since their functionality is micro-wired, but it is so that you can see that their usefulness is not it comes down to AI only.

As for AI, its implementation is due to the fact that matrix multiplication is very slow even in the SIMD units used in the GPUs or within the CPUs themselves (AVX, SSE …) so a special type of unit is needed to perform said operation as quickly as possible and hence the adoption of systolic arrays within the different CPUs to speed up AI.