The die of NVIDIA and AMD GPUs are Not Flat, Why?

die of NVIDIA and AMD GPUs

Surely on many occasions we have wondered how it is possible that with the same exact and identical GPU model they achieve different temperatures and loudness. Shouldn’t they be the same at the same room temperature and relative humidity? The theory fails, since in practice there are many problems to be solved that influence these final parameters, among which are the curvature of the AMD and NVIDIA dies. Aren’t they totally straight being made of silicon?

With the increased heat and power consumption of both the AMD RX 6000 and NVIDIA RTX 3000 GPUs, manufacturers are having a lot of trouble keeping temperatures and noise at low and stable levels. It is assumed that with the precision that the industry has when manufacturing the chips, it avoids errors as vulgar as the curvatures in them, but the reality is totally different and larger size more fault tolerance.

NVIDIA RTXs with higher tolerances than Turing

There are no real reports of the new RX 6000, they are too new, but the size of the AMD matrices makes us think that the problems are going to be very similar, mainly because neither TSMC nor Samsung are that different when it comes to manufacturing the wafers .

We always keep in mind the perfection of the manufacturing processes, but we do not usually take into account the rest of the factors, because although the chip comes from one of these two manufacturers then it has to be included in what is called ” Package “.

The process is simple: you weld and that’s where the problems come in. We start from the base as we have said that silicon is not perfect, but almost. The wafer cutting is done with extremely high precision, but joining the package achieves height differences.

NVIDIA-AMD-GPU-3d-scaling

How high are we talking about? Well, of really laughable distances for the common of the mortals, but that to the engineers of the AIB it brings them of head. It must be emphasized that this is not something exclusive to the RTX 3000 or RX 6000, but that it occurs on any chip, including processors, only that in the case of these latest graphics cards due to the size of the die, the differences are more palpable.

To test the curvature of a chip, an average temperature of 20º C is established as a rule and a 3D scanner is used that takes a topographic map of the surface. How much distance will there be from the flattest point to the highest?

Both NVIDIA and AMD have high tolerance for errors

In the example that concerns us, the curvature of the matrix is 0.068 mm , but you can find chips with greater tolerances and even so they are within what NVIDIA considers suitable for the GA102 (chip in the image) .

The image below shows precisely a curvature of no less than 0.08mm, which is still within specification, but is challenging for those designing blocks or heatsinks. The problem with this is that the deformations occur in the corners as a general rule, leaving a much larger step in certain parts of the chip in contact with the heatsink.

In the example below you can see perfectly how in one corner the die touches what would be the heatsink, represented by the black rectangle and being perfectly flat, which would leave us more than double the distance in the opposite corner if it were at level: 0.156 mm .

If this is enough, in this regard NVIDIA specifies a tolerance range of up to 0.3 mm at different points on the PCB and chip, so we are talking about a real barbarity of GAP that logically should not occur if we want a minimum contact .

But it is that if a whole PCB is analyzed, it is perfectly clear how the height differences are constant and there are almost no perfectly flat areas. In the image below, only what is painted green indicates 0.00 mm of difference, that is, total straightness.

NVIDIA-AMD-GPU-3d-scaling-8

The rest is debated between very concave or very convex, with really worrying points, included in the chip itself.

PCB and mounting problems

NVIDIA-AMD-GPU-3d-scaling-4

Let’s imagine that we have a 0.2 or 0.3 mm GAP in different parts of the PCB or chip, what is going to happen as collateral damage? Well, the heatsink and the PCB are going to have this distance at some point, so when screwing it in we are going to force the PCB at that point. The solution? Well, it is simply palliative and there is not much else to do: use thermal pads and a very dense thermal paste.

It is the cheapest and simplest way to correct the problem, but logically it implies losing many watts of heat by simple bad contact. The greater the thickness of the thermal pads, the worse their performance and therefore what they cool will have a worse temperature. Due to this, manufacturers no longer use a heatsink as a rule for all components, but have some specific ones apart from the heatsink itself.

A midplate, for example, is an intermediate device that fixes the VRAMs and in some cases the MOSFETs, leaving the GPU only for the four screws of the main block of the heatsink and allowing a better coupling. Unfortunately, this does not happen in blocks of water for example, where everything is solid and in one piece, which drives liquid cooling engineers head over heels for those tolerances that increasingly expand the range.

NVIDIA-AMD-GPU-3d-scaling-9

A very notorious case of this problem was ASUS and AMD, where their GPUs got quite hot for the simple fact that the brand was guided by AMD’s recommendations for heights. The result we all know: new washers to maximize pressure, since ASUS had adjusted everything to AMD’s tolerances, but in practice the heatsink barely touched and the GPU skyrocketed in temperature.

Or simply the case of Hitachi’s 45-watt graphite thermalpad for the Radeon VII for example, where the height differences with the HBM2E made the use of thermal paste obsolete.

NVIDIA-AMD-GPU-3d-scaling-10

In CPU it is a bit similar, where the ISHs are concave on one side and IHS on the other, the die are mostly welded and therefore the fault tolerance is higher.