Apple M1: Architecture, News and Features

Ever since Apple announced the abandonment of Intel CPUs for its Apple Macintosh to use processors of its own design, the so-called Apple Silicon, the web has been filled with apocalyptic messages about the end of the x86 architecture and how superior the architectures with ISA ARM. But what does Apple’s M1 architecture look like and how does it compare to that of a PC?

On this website we have discussed all types of processors, but usually they are those that are compatible with the x86 set of registers and instructions, but due to the controversy in recent months with Apple’s M1 we have decided to do an article on its architecture.

The Apple M1 is not a CPU, it is a SoC

The first thing to keep in mind is that the Apple M1 is not a CPU like those of Intel or AMD, but it is a complete SoC that apart from the CPU includes a series of specialized units of different categories and utility, which are the following:

CPU, which will be the one that we will discuss later in this article.
GPU, which processes graphics.
Image processing unit or ISP.
Digital Signal Processor or DSP, which is used to decompress music files as well as for very complex mathematical operations.
Neural Processing Unit, a processor dedicated to AI.
Video Encoder and Decoder for playing and storing movies.
Data encryption units for security.
I / O units that manage external peripherals as well as the information that is sent to them.
A large last-level cache, which is essential for unified memory and is called System Level Cache

If we were to be talking about all these units we would need a book, that is why we are going to talk about the CPU exclusively to answer the question about its performance, with respect to the CPUs that are in PC.

When there is no variety in hardware it is easier to optimize programs

One of the things that differentiates the PC from other platforms is that each component has a thousand different products and therefore an incredible number of configurations end up being created, on the other hand, with Apple computers starting with the M1, all the hardware except the RAM and the storage are on the Apple SoC.

What does this allow? Well, it basically allows applications to be optimized to a single configuration, which is not different from what happens in a console that has a life of years on the market and ends up having optimized code even five years after its release. On the other hand, in PC, the versatility when choosing means that nothing at all can be optimized.

In a PC when we execute a program everything will go to the CPU, but possibly there is part of the code that would be nice to be able to execute it in units much more specialized than the CPU, but the enormous variety of hardware in the PC makes optimizing the code in order to use other hardware units to accelerate the programs in a task of Sisyphus.

Unified memory

One of Apple’s secret weapons against the PC is unified memory, but first of all we must clarify that unified memory does not refer to the fact that the different elements share the same memory at a physical level, but that unified memory means that all the SoC elements understand memory in the same way.

That is, when the GPU modifies a memory address, this data is directly modified for the rest of the Apple M1 elements at the same memory address. In PCs and derived architectures that use unified memory, it is even necessary to use DMA units that copy the data from the RAM space assigned to one unit to another unit, which adds latency when executing the code and reduces the possibility of collaboration between the parties.

So, thanks to the M1’s unified memory, macOS developers can choose to run some code in units that resolve it faster than the CPU.

The high-performance CPU of the Apple M1: Firestorm

The Apple M1 despite being a multicore CPU actually uses two different types of CPU. On the one hand, a high-performance but worse-performance core called Icestorm and on the other high-performance but poorer energy-efficient cores called Firestorm, which are the ones we are going to deal with, since they are the ones that Apple stands up to the x86. high perfomance.

It is in the Firestorm cores that we will look at, that in the Apple M1 there are four cores in total and they are the ones with which Apple has decided to stand up to high-performance processors in PC, and has done so with a high-performance core, the which to understand the reason for its performance before we have to comment on a topic that is general for all CPUs.

Decoders on CPUs out of order

In the first phase of the second stage of the instruction cycle, what is done is to convert the instructions into microinstructions, which are much simpler but easy to implement in silicon. A microinstruction is not a complete instruction in itself due to the fact that it does not represent an action, but several of these in combination form more complex instructions.

Therefore, internally, no CPU executes the program binary as it is, but each one of them has a process of transforming instructions to sets of microinstructions. But the thing does not end here, in a contemporary processor the execution is out of order, which means that the program does not execute in the order of the sequence, but in the order in which the execution units are available.

So the first thing the decoder does once it has converted the instruction into microinstructions is to place them in what we call the reordering buffer, in which they are placed in the form of a list in the order in which the different execution units are going to be. available next to the position that are in the correct order of the program. So the program will run more efficiently and the instructions will not have to wait for the execution unit to be free, then the result is written in the correct order of the program.

The secret weapon of the Apple M1’s Firestorm cores: its decoder

The instruction decoding stage is the second stage of the instruction cycle. In any processor that works in parallel, the decoder needs to be able to process several instructions at the same time and send them to the appropriate execution units to be solved.

The M1 advantage? The fact of having a decoder capable of handling 8 simultaneous instructions, which makes it the widest processor in this regard, since this allows it to process a greater number of instructions in parallel, so it also allows Apple to place a more instructions. But the reason Apple has been able to do this is due to the nature of the ARM instruction set compared to x86, especially when it comes to decoding.

ARM instructions have the advantage of having a fixed size, this means that in binary code, each number of bits is an instruction. On the other hand, the x86 have a variable size. Which means that the code has to go through several decoders before it becomes a microinstruction. The consequences of it? Well, the fact that the part of the hardware dedicated to decoding the instructions not only ends up taking up much more space and consumes more, but less simultaneous instructions can be decoded under the same size.

And here we enter the great advantage of the M1. How many full decoders do Intel and AMD CPUs have? Well, the average is four, just half. Which gives the M1’s Firestorms the ability to execute twice as many instructions simultaneously as Intel and AMD CPUs.

Apple M1 vs Intel and AMD

Executing twice as many instructions does not mean solving twice as many instructions, the counterpart of ARM-based kernels is that they require a greater number of single instruction cycles and therefore clock cycles to execute a program. So an x86 with the same width would be much more powerful than an ARM, but it would require a greater number of transistors and a very complex processor in terms of size.

Over time both AMD and Intel will increase the IPC of their processors, but are limited by the complexity of the x86 instruction set and its decoder. It is not that they cannot make an x86 CPU with eight decoders, it is that if it existed then it would be too large to be commercially viable and they have to wait for the new nodes to appear before increasing the IPC per core.