Intel AMX: How Intel Will Use Artificial Intelligence in Its Processors

Intel AMX

There is no doubt that what are called specialized units for artificial intelligence have become one of the most important pieces of hardware, especially if we talk about the market for PostPC devices where all their SoCs have a unit of this type, but it is not the case of the PC but the thing could completely change this situation thanks to the Intel AMX extensions.

At this time, if we have a PC, the only way we have to have a specialized unit for AI is by buying separate hardware, either by purchasing a GPU from the NVIDIA RTX family or by buying an FPGA mounted on a PCI Express port.

The Intel GNA, a precedent

Intel GNA

Intel currently has a built-in drive called GNA that can run some AI-based algorithms but not in the same way as a systolic array since GNA is a coprocessor with SIMD configuration. On the other hand, Intel also sells solutions based on FPGAs and with its Intel Xe GPUs, HP promises to integrate units in the Tensor Core style.

But what we are talking about is precisely integrating this type of unit in a CPU, in such a way that a greater number of applications can take advantage of this type of unit.

An answer to Apple’s M1

Apple M1

One of the advantages of Apple‘s M1 is not that the ARM set of registers and instructions is more energy efficient, but that for certain applications and functions its Neural Engine is extremely efficient .

These types of units have become key in the market for smartphones and tablets because they allow very complex tasks to be carried out in a short time and with very few resources, which has made PC CPUs lag behind in that regard.

Intel AMX

As when the SIMD units brought with them the implementation of new x86 instructions, the implementation of matrix or tensor units brings with them a new type of instruction, called AMX or Advanced Matrix Extensions, which will be implemented for the first time with Intel Xeon architecture. Sapphire Rapids.

The extension adds two additional elements, on the one hand, a two-dimensional record set made up of records called “tiles” and a series of accelerators capable of operating on those tiles. These accelerators share access to memory in a consistent way with the rest of the elements of the CPU and can work interleaved with and in parallel with other x86 execution units.

The accelerator is called Tile Matrix Multiply or TMUL, it is a systolic array in the form of a mesh of ALUs capable of performing the FMA (Addition and Multiplication) instruction in a single cycle, which uses as records the tiles from which you we have talked about in the previous paragraph.

In AMD patents, the TMUL unit is called Data Parallel Cluster and it is a unit that is found within each of the processor cores, although Intel is going to implement it for the first time in Sapphire Rapids, there is no doubt that we are going to see it implemented in the rest of Intel processors in the future.