There's a lot to break down here. FP/SIMD vs int/ALU, package vs core power, percentage vs total, branching vs unbranching code, average IPC, etc.
Let's start with package vs core power. Package power is a terrible metric for core efficiency. The CPU cores on an M3 Max peak out at around 50w under a power virus which is around 40-50% of total package power. M3 CPU cores peak out at around 21w with a total package power of somewhere around 25-30w or 70-85% of package power.
Would you then assert that the M3 Max cores are TWICE as power efficient as the M3 cores? Of course not. They are the exact same CPU core design.
Package power changes based on the design and target market of the chip. Core power is the ONLY useful metric here. That number indicates that decoders use between 8% and 22% of total core power and this is going to be essentially true whether you are in a 30w TDP or a 300w TDP.
This ties directly into the percentage vs total hiding the truth. At 4.8w of power for the decoders in one core, an 8-core Haswell 5960x would use 38.4w of its 140w TDP on decoding (or a whopping 27.4% of TDP package power if you still believe that is relevant). On an 18-core server variant, this would theoretically be 86.4w out of a 165w TDP package power. Even if we cut down to the 1.8w you say is reasonable, that's still 14.4w for the 8-core and 32.4w for the 18-core (still 19.6% of package power).
Not only is this a significant percentage, but it is also significant in absolute watts. Quoting 3% is just an attempt to hide the truth of an ugly situation. Even if the 3% were true, chip companies spend massive amounts of money for less than 1% power savings, so it would still be important.
Next, let's discuss FP/SIMD vs int/ALU. SIMD takes more execution power than the ALU. This makes sense if you just look at a die shot. All 4 ALUs together are something like 10-20% the size of the SIMD units. When you turn a SIMD unit on, it sucks a lot of power. This is why SIMD throttles so often and Haswell is no exception. The ALUs are executing 2.3x more instructions while using 2.1x more power (which means the SIMD units are aggressively power-gating most of the SIMD execution units).
Notice the cache differences. SIMD is taxing the L1 cache more (4.8w vs 3.8w) and massively taxing the L2/L3 cache (11.2w vs 0.1w). The chip is hitting its power limits and downclocking the CPU core so it can redirect power to the caches. We see this in the FP code using 4.9w with the larger SIMD units while the ALU code used 10.4w. I'd also note that the power curve matters here because the power doesn't scale linearly with the clockspeed, so reducing the clockspeed has multiplicative effects on reducing decoder power
If we compute the decode/execution power ratios, we get .37 for SIMD and and .46 for ALU which shows that even in this ideal situation, the relative power draw isn't as good as you are led to believe.
Finally, there are 4 ALU ports, but only 2 SIMD ports. In practice, this means that half of the decoders will simply not turn on in this test or will turn on long enough to race way ahead in the uop cache then turn off.
If the core were not downclocking and there were 4 SIMD ports, the decoder power consumption would be higher than 1.8w.
You are basically correct about average IPC, but wrong about its impacts. SpecInt suite averages around 1.6-1.8 instructions/clock on Haswell[0] and is representative of most code out there (and why ARM designer's focus on very wide chips with very high IPC is important).
What it misses is branches. The CPU can't wait until a branch happens to start decoding. The branch predictor basically pre-fetches cache lines into I-cache. The decoders then take the next cache blocks and decode them into the uop cache lines. If a branch happens approximately every 5th instruction and cache lines are usually 64 bytes and average x86 instructions are 4.25 bytes long, then you can surmise that both sides of most local branches wind up being decoded even though both are not used. This means that the IPC of the decoders is higher than the IPC of the ALUs.
In all cases though, it can be stated pretty clearly that x86 decode isn't "free" and has a significant resource cost attached both in relative and absolute terms.
Let's start with package vs core power. Package power is a terrible metric for core efficiency. The CPU cores on an M3 Max peak out at around 50w under a power virus which is around 40-50% of total package power. M3 CPU cores peak out at around 21w with a total package power of somewhere around 25-30w or 70-85% of package power.
Would you then assert that the M3 Max cores are TWICE as power efficient as the M3 cores? Of course not. They are the exact same CPU core design.
Package power changes based on the design and target market of the chip. Core power is the ONLY useful metric here. That number indicates that decoders use between 8% and 22% of total core power and this is going to be essentially true whether you are in a 30w TDP or a 300w TDP.
This ties directly into the percentage vs total hiding the truth. At 4.8w of power for the decoders in one core, an 8-core Haswell 5960x would use 38.4w of its 140w TDP on decoding (or a whopping 27.4% of TDP package power if you still believe that is relevant). On an 18-core server variant, this would theoretically be 86.4w out of a 165w TDP package power. Even if we cut down to the 1.8w you say is reasonable, that's still 14.4w for the 8-core and 32.4w for the 18-core (still 19.6% of package power).
Not only is this a significant percentage, but it is also significant in absolute watts. Quoting 3% is just an attempt to hide the truth of an ugly situation. Even if the 3% were true, chip companies spend massive amounts of money for less than 1% power savings, so it would still be important.
Next, let's discuss FP/SIMD vs int/ALU. SIMD takes more execution power than the ALU. This makes sense if you just look at a die shot. All 4 ALUs together are something like 10-20% the size of the SIMD units. When you turn a SIMD unit on, it sucks a lot of power. This is why SIMD throttles so often and Haswell is no exception. The ALUs are executing 2.3x more instructions while using 2.1x more power (which means the SIMD units are aggressively power-gating most of the SIMD execution units).
Notice the cache differences. SIMD is taxing the L1 cache more (4.8w vs 3.8w) and massively taxing the L2/L3 cache (11.2w vs 0.1w). The chip is hitting its power limits and downclocking the CPU core so it can redirect power to the caches. We see this in the FP code using 4.9w with the larger SIMD units while the ALU code used 10.4w. I'd also note that the power curve matters here because the power doesn't scale linearly with the clockspeed, so reducing the clockspeed has multiplicative effects on reducing decoder power
If we compute the decode/execution power ratios, we get .37 for SIMD and and .46 for ALU which shows that even in this ideal situation, the relative power draw isn't as good as you are led to believe.
Finally, there are 4 ALU ports, but only 2 SIMD ports. In practice, this means that half of the decoders will simply not turn on in this test or will turn on long enough to race way ahead in the uop cache then turn off.
If the core were not downclocking and there were 4 SIMD ports, the decoder power consumption would be higher than 1.8w.
You are basically correct about average IPC, but wrong about its impacts. SpecInt suite averages around 1.6-1.8 instructions/clock on Haswell[0] and is representative of most code out there (and why ARM designer's focus on very wide chips with very high IPC is important).
What it misses is branches. The CPU can't wait until a branch happens to start decoding. The branch predictor basically pre-fetches cache lines into I-cache. The decoders then take the next cache blocks and decode them into the uop cache lines. If a branch happens approximately every 5th instruction and cache lines are usually 64 bytes and average x86 instructions are 4.25 bytes long, then you can surmise that both sides of most local branches wind up being decoded even though both are not used. This means that the IPC of the decoders is higher than the IPC of the ALUs.
In all cases though, it can be stated pretty clearly that x86 decode isn't "free" and has a significant resource cost attached both in relative and absolute terms.
[0] https://tosiron.com/papers/2018/SPEC2017_ISPASS18.pdf