Comment by hajile - Hacker Neue

hajile 2 days ago parent

In Haswell, 4.8w out of 22.1w for the core were used for the decoder for integer/ALU instructions[0]. According to this[1] analysis of the entire ubuntu repository, 89% of all instructions were composed of just 12 instructions (all integer/ALU).

From this we can infer that for most normal workloads, almost 22% of the Haswell core power was used in the decoder. As decoders have gotten wider and more complex in recent designs, I see no reason why this wouldn't be just as true for today's CPUs.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf

menaerus 1 day ago

Misconstrued arguments.

First paper is saying that they measured 3% for the floating-point workloads and 10% for the integer workloads.

> Based on Figure 3, the power consumption of the instruction decoders is very small compared with the other components. Only 3% of the total package power is consumed by the instruction decoding pipeline in this case.

and

> As a result, the instruction decoders end up consuming 10% of the total package power in benchmark #2.

Then the paper continues to say that the benchmark was synthetic and nothing close to the extrapolation of yours that the experiment results would apply if repeated on entire ubuntu repositories.

> Nevertheless, we would like to point out that this benchmark is completely synthetic.

And finally paper says that the typical power draw is expected to be much lower in real-world scenarios. Their microbenchmark is basically measuring a 10% as an upper bound for instruction decoder power draw for what would be almost the highest IPC achievable on that machine - Haswell is 4-wide decode machine and 3.86 IPC on real-world code happens never, as they also acknowledge:

> Real applications typically do not reach IPC counts as high as this. Thus, the power consumption of the instruction decoders is likely less than 10% for real applications.

If anything can be concluded from this article is that the power draw of instruction decoder is 3% when IPC is 1.67, and this much more closely resembles the IPC figures of real-world programs.

hajile OP 1 day ago

There's a lot to break down here. FP/SIMD vs int/ALU, package vs core power, percentage vs total, branching vs unbranching code, average IPC, etc.

Let's start with package vs core power. Package power is a terrible metric for core efficiency. The CPU cores on an M3 Max peak out at around 50w under a power virus which is around 40-50% of total package power. M3 CPU cores peak out at around 21w with a total package power of somewhere around 25-30w or 70-85% of package power.

Would you then assert that the M3 Max cores are TWICE as power efficient as the M3 cores? Of course not. They are the exact same CPU core design.

Package power changes based on the design and target market of the chip. Core power is the ONLY useful metric here. That number indicates that decoders use between 8% and 22% of total core power and this is going to be essentially true whether you are in a 30w TDP or a 300w TDP.

This ties directly into the percentage vs total hiding the truth. At 4.8w of power for the decoders in one core, an 8-core Haswell 5960x would use 38.4w of its 140w TDP on decoding (or a whopping 27.4% of TDP package power if you still believe that is relevant). On an 18-core server variant, this would theoretically be 86.4w out of a 165w TDP package power. Even if we cut down to the 1.8w you say is reasonable, that's still 14.4w for the 8-core and 32.4w for the 18-core (still 19.6% of package power).

Not only is this a significant percentage, but it is also significant in absolute watts. Quoting 3% is just an attempt to hide the truth of an ugly situation. Even if the 3% were true, chip companies spend massive amounts of money for less than 1% power savings, so it would still be important.

Next, let's discuss FP/SIMD vs int/ALU. SIMD takes more execution power than the ALU. This makes sense if you just look at a die shot. All 4 ALUs together are something like 10-20% the size of the SIMD units. When you turn a SIMD unit on, it sucks a lot of power. This is why SIMD throttles so often and Haswell is no exception. The ALUs are executing 2.3x more instructions while using 2.1x more power (which means the SIMD units are aggressively power-gating most of the SIMD execution units).

Notice the cache differences. SIMD is taxing the L1 cache more (4.8w vs 3.8w) and massively taxing the L2/L3 cache (11.2w vs 0.1w). The chip is hitting its power limits and downclocking the CPU core so it can redirect power to the caches. We see this in the FP code using 4.9w with the larger SIMD units while the ALU code used 10.4w. I'd also note that the power curve matters here because the power doesn't scale linearly with the clockspeed, so reducing the clockspeed has multiplicative effects on reducing decoder power

If we compute the decode/execution power ratios, we get .37 for SIMD and and .46 for ALU which shows that even in this ideal situation, the relative power draw isn't as good as you are led to believe.

Finally, there are 4 ALU ports, but only 2 SIMD ports. In practice, this means that half of the decoders will simply not turn on in this test or will turn on long enough to race way ahead in the uop cache then turn off.

If the core were not downclocking and there were 4 SIMD ports, the decoder power consumption would be higher than 1.8w.

You are basically correct about average IPC, but wrong about its impacts. SpecInt suite averages around 1.6-1.8 instructions/clock on Haswell[0] and is representative of most code out there (and why ARM designer's focus on very wide chips with very high IPC is important).

What it misses is branches. The CPU can't wait until a branch happens to start decoding. The branch predictor basically pre-fetches cache lines into I-cache. The decoders then take the next cache blocks and decode them into the uop cache lines. If a branch happens approximately every 5th instruction and cache lines are usually 64 bytes and average x86 instructions are 4.25 bytes long, then you can surmise that both sides of most local branches wind up being decoded even though both are not used. This means that the IPC of the decoders is higher than the IPC of the ALUs.

In all cases though, it can be stated pretty clearly that x86 decode isn't "free" and has a significant resource cost attached both in relative and absolute terms.

[0] https://tosiron.com/papers/2018/SPEC2017_ISPASS18.pdf

This item has no comments currently.