Even though this was the case for the most part during the entire history of PPC Macs (I owned two during these years)
https://www.intel.com/content/www/us/en/developer/articles/t...
Their claim that ARM decoders are just as complex wasn't true then and is even less true now. ARM reduced decoder size 75% from A710 to A715 by dropping legacy 32-bit stuff. Considering that x86 is way more complex than 32-bit ARM, the difference between an x86 and ARM decoder implementation is absolutely massive.
They abuse the decoder power paper (and that paper also draws a conclusion its own data doesn't support). The data shows that for integers/ALU, some 22% of total core power is used by the decoder for integer/ALU workloads. As 89% of all instructions in the entire Ubuntu repos are just 12 integer/ALU instructions, we can infer that the power cost of the decoder is significant (I'd consider nearly a quarter of the total power budget to be significant anyway).
The x86 decoder situation has gotten worse with Golden Cove (with 6 decoders) being infamous for its power draw and AMD fearing power draw so much that they opted for a super-complex dual 4-wide decoder setup. If the decoder power didn't matter, they'd be doing 10-wide decoders like the ARM designers.
The claim that ARM uses uops too is somewhere between a red herring and false equivalency. ARM uops are certainly less complex to create (otherwise they'd have kept around the uop cache) and ARM instructions being inherently less complex means that uop encoding is also going to be more simple for a given uarch compared to x86.
They then have an argument that proves too much when they say ARM has bloat too. If bloat doesn't matter, why did ARM make an entirely new ISA that ditches backward compatibility? Why take any risk to their ecosystem if there's no reward?
They also skip over the fact that objectively bad design exists. NOBODY out there defends branch delay slots. They are universally considered an active impediment to high-performance designs with ISAs like MIPS going so far as to create duplicate instructions without branch delay slots in order to speed things up. You can't argue that ISA definitely matters here, but also argue that ISA never makes any difference at all.
The "all ISAs get bloated over time" is sheer ignorance. x86 has roots going back to the early 1970s before we'd figured out computing. All the basics of CPU design are now stable and haven't really changed in 30+ years. x86 has x87 which has 80-bits because IEEE 754 didn't exist yet. Modern ISAs aren't repeating that mistake. x86 having 8 registers isn't a mistake they are going to make. Neither is 15 different 128-bit SIMD extensions or any of the many other bloated mess-ups x86 has made over the last 50+ years. There may be mistakes, but they are almost certainly going to be on fringe things. In the meantime, the core instructions will continue to be superior to x86 forever.
They also fail to address implementation complexity. Some of the weirdness of x86 like tighter memory timing gets dragged through the entire system complicating things. If this results in just 10% higher cost and 10% longer development time, that means a RISC company could develop a chip for $5.4B over 4.5 years instead of $6B over 5 years which represents a massive savings and a much lower opportunity cost while giving a compounding head-start on their x86 competitor that can be used to either hit the market sooner or make even larger performance jumps each generation.
Finally, optimizing something like RISC-V code is inherently easier/faster than optimizing x86 code because there is less weirdness to work around. RISC-V basically just has one way to do something and it'll always be optimized while x86 often has different ways to do the same thing and each has different tradeoffs that make sense in various scenarios.
As to PPC, Apple didn't sell enough laptops to pay for Motorola to put enough money into the designs to stay competitive.
Today, Apple macbooks + phones move nearly 220M chips per year. For comparison, total laptop sales last year were around 260M. If Apple had Motorola make a chip today, Motorola would have the money to build a PPC chip that could compete with and surpass what x86 offers.
This hasn't been true for decades. Mainframes are fast because they have proprietary architectures that are purpose-built for high throughput and redundancy, not because they're RISC. The pre-eminent mainframe architecture these days (z/Architecture) is categorized as CISC.
Processors are insanely complicated these days. Branch prediction, instruction decoding, micro-ops, reordering, speculative execution, cache tiering strategies... I could go on and on but you get the idea. It's no longer as obvious as "RISC -> orthogonal addressing and short instructions -> speed".
From this we can infer that for most normal workloads, almost 22% of the Haswell core power was used in the decoder. As decoders have gotten wider and more complex in recent designs, I see no reason why this wouldn't be just as true for today's CPUs.
[0] https://www.usenix.org/system/files/conference/cooldc16/cool...
First paper is saying that they measured 3% for the floating-point workloads and 10% for the integer workloads.
> Based on Figure 3, the power consumption of the instruction decoders is very small compared with the other components. Only 3% of the total package power is consumed by the instruction decoding pipeline in this case.
and
> As a result, the instruction decoders end up consuming 10% of the total package power in benchmark #2.
Then the paper continues to say that the benchmark was synthetic and nothing close to the extrapolation of yours that the experiment results would apply if repeated on entire ubuntu repositories.
> Nevertheless, we would like to point out that this benchmark is completely synthetic.
And finally paper says that the typical power draw is expected to be much lower in real-world scenarios. Their microbenchmark is basically measuring a 10% as an upper bound for instruction decoder power draw for what would be almost the highest IPC achievable on that machine - Haswell is 4-wide decode machine and 3.86 IPC on real-world code happens never, as they also acknowledge:
> Real applications typically do not reach IPC counts as high as this. Thus, the power consumption of the instruction decoders is likely less than 10% for real applications.
If anything can be concluded from this article is that the power draw of instruction decoder is 3% when IPC is 1.67, and this much more closely resembles the IPC figures of real-world programs.
Let's start with package vs core power. Package power is a terrible metric for core efficiency. The CPU cores on an M3 Max peak out at around 50w under a power virus which is around 40-50% of total package power. M3 CPU cores peak out at around 21w with a total package power of somewhere around 25-30w or 70-85% of package power.
Would you then assert that the M3 Max cores are TWICE as power efficient as the M3 cores? Of course not. They are the exact same CPU core design.
Package power changes based on the design and target market of the chip. Core power is the ONLY useful metric here. That number indicates that decoders use between 8% and 22% of total core power and this is going to be essentially true whether you are in a 30w TDP or a 300w TDP.
This ties directly into the percentage vs total hiding the truth. At 4.8w of power for the decoders in one core, an 8-core Haswell 5960x would use 38.4w of its 140w TDP on decoding (or a whopping 27.4% of TDP package power if you still believe that is relevant). On an 18-core server variant, this would theoretically be 86.4w out of a 165w TDP package power. Even if we cut down to the 1.8w you say is reasonable, that's still 14.4w for the 8-core and 32.4w for the 18-core (still 19.6% of package power).
Not only is this a significant percentage, but it is also significant in absolute watts. Quoting 3% is just an attempt to hide the truth of an ugly situation. Even if the 3% were true, chip companies spend massive amounts of money for less than 1% power savings, so it would still be important.
Next, let's discuss FP/SIMD vs int/ALU. SIMD takes more execution power than the ALU. This makes sense if you just look at a die shot. All 4 ALUs together are something like 10-20% the size of the SIMD units. When you turn a SIMD unit on, it sucks a lot of power. This is why SIMD throttles so often and Haswell is no exception. The ALUs are executing 2.3x more instructions while using 2.1x more power (which means the SIMD units are aggressively power-gating most of the SIMD execution units).
Notice the cache differences. SIMD is taxing the L1 cache more (4.8w vs 3.8w) and massively taxing the L2/L3 cache (11.2w vs 0.1w). The chip is hitting its power limits and downclocking the CPU core so it can redirect power to the caches. We see this in the FP code using 4.9w with the larger SIMD units while the ALU code used 10.4w. I'd also note that the power curve matters here because the power doesn't scale linearly with the clockspeed, so reducing the clockspeed has multiplicative effects on reducing decoder power
If we compute the decode/execution power ratios, we get .37 for SIMD and and .46 for ALU which shows that even in this ideal situation, the relative power draw isn't as good as you are led to believe.
Finally, there are 4 ALU ports, but only 2 SIMD ports. In practice, this means that half of the decoders will simply not turn on in this test or will turn on long enough to race way ahead in the uop cache then turn off.
If the core were not downclocking and there were 4 SIMD ports, the decoder power consumption would be higher than 1.8w.
You are basically correct about average IPC, but wrong about its impacts. SpecInt suite averages around 1.6-1.8 instructions/clock on Haswell[0] and is representative of most code out there (and why ARM designer's focus on very wide chips with very high IPC is important).
What it misses is branches. The CPU can't wait until a branch happens to start decoding. The branch predictor basically pre-fetches cache lines into I-cache. The decoders then take the next cache blocks and decode them into the uop cache lines. If a branch happens approximately every 5th instruction and cache lines are usually 64 bytes and average x86 instructions are 4.25 bytes long, then you can surmise that both sides of most local branches wind up being decoded even though both are not used. This means that the IPC of the decoders is higher than the IPC of the ALUs.
In all cases though, it can be stated pretty clearly that x86 decode isn't "free" and has a significant resource cost attached both in relative and absolute terms.
ARM is great. Those M are the only thing I could buy used and put Linux on it.