More cores doesn’t mean that the computer is more powerful

The fact of choosing a processor with more cores often does not translate into a performance increase to the same degree when using certain programs. Why does this phenomenon occur and, therefore, what are the causes of it? We explain it to you in detail.

One of the reasons to use newer versions of programs over time is that they are designed to take better advantage of processors with higher core counts. Let’s not forget that with the passage of time the number of these in the CPUs is increasing. However, why doesn’t performance increase in programs at par?

Programs never scale with the number of cores

It is important to take into account that the programs that are executed do not have the ability to divide their active processes or tasks at any given moment, according to the number of execution threads that we have in our CPU. More than anything due to the fact that this division is explicit in the program code, that is, it is the product of the programmer’s skill and the application’s design.

Actually, what is relevant when coding a program is not to optimize it in order to use the largest number of cores possible, but rather for latency. Understanding the latter as the time it takes a processor to complete a task measured in units of time. And it is that the performance of a CPU consists of completing the most tasks in the shortest possible time. Which will depend on your architecture first and your clock speed second.

However, what interests us regarding latency is knowing how many tasks it can finish in a given period, which is the workload and this will depend on the situation and the way in which the programs have been written. In other words, performance depends not only on the hardware, but on how well or poorly the software has been written.

Division of labor in several cores

Now, if we increase the number of cores in a system, then it becomes possible to break the work into pieces and have it completed much easier. This is where the T/N formula comes in, where T is the number of tasks to perform and N is the number of execution threads that the system can execute. Obviously, we could load the maximum number of jobs on few cores and brute force them to fix them. The problem is that this measure is counterproductive because it benefits the most modern CPUs, which have higher performance individually on each core.

However, dividing the work between different cores is additional work that is usually given to a core that acts as a conductor and has to do the following tasks:

You have to create processes and task lists and have good control of it at all times.
They must know how to predict at all times when a task begins and ends, including the time it takes to finish one and start another.
The different kernels must have the ability to send a signal to the main kernel to know when a process starts and ends.

This solution was adopted by SONY, Toshiba and IBM in the Cell Broadband Engine, the central processor of the PS3 where a master core was in charge of directing the rest. Although much further back it was adopted by the Atari Jaguar. For PS4 SONY did not repeat this model again and nobody has implemented it on the PC because it is a nightmare, however, it is the most efficient way to divide the work.

Not everything can run on multiple cores

If we ask ourselves if we can divide any task into subtasks to distribute in a greater number of cores indefinitely, the answer is no. Specifically, we have to classify tasks into three different types:

Those that can be totally parallelized and, therefore, divided between the different cores that the central processor has.
Tasks that can be run partially in parallel.
Parts of the code that cannot be executed in parallel.

In the first case, T/N is applied to 100%, in the second case, we already enter the so-called Amdahl’s Law where the acceleration due to increasing the number of cores is partial and in the third case we simply need all the power of a single core for that task,

What differentiates the CPU from the GPU in multithreading

Here we come to a differential point, every GPU or graphics chip has a control unit that is responsible for reading the command lists and distributing them among the different GPU cores and even among the different units. This is a hardware-level implementation of the previous case and works perfectly in any configuration where you want to saturate, as long as there is work, and therefore keep as many cores busy as possible. However, we must understand that the concept thread of execution in a GPU always corresponds to a corresponding data and its list of instructions. That is, a pixel, vertex or any data.

Which makes them easy to parallelize. That is, if we wanted to fry an egg, the process in the CPU would be to fry the egg, which would be totally sequential. On the other hand, in the graphics chip, a task would simply be to heat oil or add an egg to the pan. All of this would not speed up frying one egg, but several, which is why GPUs are better for tasks such as calculating millions of polygons or pixels at the same time, but not for sequential tasks.