Comment by roscas - Hacker Neue

roscas 3 days ago parent

RISC vs CISC. Why you think a mainframe is so fast?

ARM is great. Those M are the only thing I could buy used and put Linux on it.

JustExAWS 3 days ago

I thought people stopped believing this around 2005 when Apple users finally had to admit that PPC was behind x86.

Even though this was the case for the most part during the entire history of PPC Macs (I owned two during these years)

https://chipsandcheese.com/p/arm-or-x86-isa-doesnt-matter

astrange 2 days ago

It especially doesn't matter because the latest x86 update adds a mode that turns it into ARM.

https://www.intel.com/content/www/us/en/developer/articles/t...

marshray 2 days ago

RISC lost its meaning once SPARC added an integer multiply instruction.

platevoltage 2 days ago

At least my G5 helped keep my room warm in the winter.

hajile 2 days ago

Cheese and Chips makes some bad arguments in that article.

Their claim that ARM decoders are just as complex wasn't true then and is even less true now. ARM reduced decoder size 75% from A710 to A715 by dropping legacy 32-bit stuff. Considering that x86 is way more complex than 32-bit ARM, the difference between an x86 and ARM decoder implementation is absolutely massive.

They abuse the decoder power paper (and that paper also draws a conclusion its own data doesn't support). The data shows that for integers/ALU, some 22% of total core power is used by the decoder for integer/ALU workloads. As 89% of all instructions in the entire Ubuntu repos are just 12 integer/ALU instructions, we can infer that the power cost of the decoder is significant (I'd consider nearly a quarter of the total power budget to be significant anyway).

The x86 decoder situation has gotten worse with Golden Cove (with 6 decoders) being infamous for its power draw and AMD fearing power draw so much that they opted for a super-complex dual 4-wide decoder setup. If the decoder power didn't matter, they'd be doing 10-wide decoders like the ARM designers.

The claim that ARM uses uops too is somewhere between a red herring and false equivalency. ARM uops are certainly less complex to create (otherwise they'd have kept around the uop cache) and ARM instructions being inherently less complex means that uop encoding is also going to be more simple for a given uarch compared to x86.

They then have an argument that proves too much when they say ARM has bloat too. If bloat doesn't matter, why did ARM make an entirely new ISA that ditches backward compatibility? Why take any risk to their ecosystem if there's no reward?

They also skip over the fact that objectively bad design exists. NOBODY out there defends branch delay slots. They are universally considered an active impediment to high-performance designs with ISAs like MIPS going so far as to create duplicate instructions without branch delay slots in order to speed things up. You can't argue that ISA definitely matters here, but also argue that ISA never makes any difference at all.

The "all ISAs get bloated over time" is sheer ignorance. x86 has roots going back to the early 1970s before we'd figured out computing. All the basics of CPU design are now stable and haven't really changed in 30+ years. x86 has x87 which has 80-bits because IEEE 754 didn't exist yet. Modern ISAs aren't repeating that mistake. x86 having 8 registers isn't a mistake they are going to make. Neither is 15 different 128-bit SIMD extensions or any of the many other bloated mess-ups x86 has made over the last 50+ years. There may be mistakes, but they are almost certainly going to be on fringe things. In the meantime, the core instructions will continue to be superior to x86 forever.

They also fail to address implementation complexity. Some of the weirdness of x86 like tighter memory timing gets dragged through the entire system complicating things. If this results in just 10% higher cost and 10% longer development time, that means a RISC company could develop a chip for $5.4B over 4.5 years instead of $6B over 5 years which represents a massive savings and a much lower opportunity cost while giving a compounding head-start on their x86 competitor that can be used to either hit the market sooner or make even larger performance jumps each generation.

Finally, optimizing something like RISC-V code is inherently easier/faster than optimizing x86 code because there is less weirdness to work around. RISC-V basically just has one way to do something and it'll always be optimized while x86 often has different ways to do the same thing and each has different tradeoffs that make sense in various scenarios.

As to PPC, Apple didn't sell enough laptops to pay for Motorola to put enough money into the designs to stay competitive.

Today, Apple macbooks + phones move nearly 220M chips per year. For comparison, total laptop sales last year were around 260M. If Apple had Motorola make a chip today, Motorola would have the money to build a PPC chip that could compete with and surpass what x86 offers.

JustExAWS 2 days ago

Fair enough.

And don’t forget that Apple can do things like completely remove all of the hardware that supports 32 bit code and tell developers to just deal with it.

alexjplant 2 days ago

> RISC vs CISC. Why you think a mainframe is so fast?

This hasn't been true for decades. Mainframes are fast because they have proprietary architectures that are purpose-built for high throughput and redundancy, not because they're RISC. The pre-eminent mainframe architecture these days (z/Architecture) is categorized as CISC.

Processors are insanely complicated these days. Branch prediction, instruction decoding, micro-ops, reordering, speculative execution, cache tiering strategies... I could go on and on but you get the idea. It's no longer as obvious as "RISC -> orthogonal addressing and short instructions -> speed".

musicale 2 days ago

> The pre-eminent mainframe architecture these days (z/Architecture) is categorized as CISC.

Very much so. It's largely a register-memory (and indeed memory-memory) rather than load-store architecture, and a direct descendant of the System/360 from 1964.

baq 2 days ago

Everything is RISC after it gets decoded. It isn’t 1990 anymore. The decoder costs maybe 1% performance.

hajile 2 days ago

In Haswell, 4.8w out of 22.1w for the core were used for the decoder for integer/ALU instructions[0]. According to this[1] analysis of the entire ubuntu repository, 89% of all instructions were composed of just 12 instructions (all integer/ALU).

From this we can infer that for most normal workloads, almost 22% of the Haswell core power was used in the decoder. As decoders have gotten wider and more complex in recent designs, I see no reason why this wouldn't be just as true for today's CPUs.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf

menaerus 1 day ago

Misconstrued arguments.

First paper is saying that they measured 3% for the floating-point workloads and 10% for the integer workloads.

> Based on Figure 3, the power consumption of the instruction decoders is very small compared with the other components. Only 3% of the total package power is consumed by the instruction decoding pipeline in this case.

and

> As a result, the instruction decoders end up consuming 10% of the total package power in benchmark #2.

Then the paper continues to say that the benchmark was synthetic and nothing close to the extrapolation of yours that the experiment results would apply if repeated on entire ubuntu repositories.

> Nevertheless, we would like to point out that this benchmark is completely synthetic.

And finally paper says that the typical power draw is expected to be much lower in real-world scenarios. Their microbenchmark is basically measuring a 10% as an upper bound for instruction decoder power draw for what would be almost the highest IPC achievable on that machine - Haswell is 4-wide decode machine and 3.86 IPC on real-world code happens never, as they also acknowledge:

> Real applications typically do not reach IPC counts as high as this. Thus, the power consumption of the instruction decoders is likely less than 10% for real applications.

If anything can be concluded from this article is that the power draw of instruction decoder is 3% when IPC is 1.67, and this much more closely resembles the IPC figures of real-world programs.

hajile 1 day ago

There's a lot to break down here. FP/SIMD vs int/ALU, package vs core power, percentage vs total, branching vs unbranching code, average IPC, etc.

Let's start with package vs core power. Package power is a terrible metric for core efficiency. The CPU cores on an M3 Max peak out at around 50w under a power virus which is around 40-50% of total package power. M3 CPU cores peak out at around 21w with a total package power of somewhere around 25-30w or 70-85% of package power.

Would you then assert that the M3 Max cores are TWICE as power efficient as the M3 cores? Of course not. They are the exact same CPU core design.

Package power changes based on the design and target market of the chip. Core power is the ONLY useful metric here. That number indicates that decoders use between 8% and 22% of total core power and this is going to be essentially true whether you are in a 30w TDP or a 300w TDP.

This ties directly into the percentage vs total hiding the truth. At 4.8w of power for the decoders in one core, an 8-core Haswell 5960x would use 38.4w of its 140w TDP on decoding (or a whopping 27.4% of TDP package power if you still believe that is relevant). On an 18-core server variant, this would theoretically be 86.4w out of a 165w TDP package power. Even if we cut down to the 1.8w you say is reasonable, that's still 14.4w for the 8-core and 32.4w for the 18-core (still 19.6% of package power).

Not only is this a significant percentage, but it is also significant in absolute watts. Quoting 3% is just an attempt to hide the truth of an ugly situation. Even if the 3% were true, chip companies spend massive amounts of money for less than 1% power savings, so it would still be important.

Next, let's discuss FP/SIMD vs int/ALU. SIMD takes more execution power than the ALU. This makes sense if you just look at a die shot. All 4 ALUs together are something like 10-20% the size of the SIMD units. When you turn a SIMD unit on, it sucks a lot of power. This is why SIMD throttles so often and Haswell is no exception. The ALUs are executing 2.3x more instructions while using 2.1x more power (which means the SIMD units are aggressively power-gating most of the SIMD execution units).

Notice the cache differences. SIMD is taxing the L1 cache more (4.8w vs 3.8w) and massively taxing the L2/L3 cache (11.2w vs 0.1w). The chip is hitting its power limits and downclocking the CPU core so it can redirect power to the caches. We see this in the FP code using 4.9w with the larger SIMD units while the ALU code used 10.4w. I'd also note that the power curve matters here because the power doesn't scale linearly with the clockspeed, so reducing the clockspeed has multiplicative effects on reducing decoder power

If we compute the decode/execution power ratios, we get .37 for SIMD and and .46 for ALU which shows that even in this ideal situation, the relative power draw isn't as good as you are led to believe.

Finally, there are 4 ALU ports, but only 2 SIMD ports. In practice, this means that half of the decoders will simply not turn on in this test or will turn on long enough to race way ahead in the uop cache then turn off.

If the core were not downclocking and there were 4 SIMD ports, the decoder power consumption would be higher than 1.8w.

You are basically correct about average IPC, but wrong about its impacts. SpecInt suite averages around 1.6-1.8 instructions/clock on Haswell[0] and is representative of most code out there (and why ARM designer's focus on very wide chips with very high IPC is important).

What it misses is branches. The CPU can't wait until a branch happens to start decoding. The branch predictor basically pre-fetches cache lines into I-cache. The decoders then take the next cache blocks and decode them into the uop cache lines. If a branch happens approximately every 5th instruction and cache lines are usually 64 bytes and average x86 instructions are 4.25 bytes long, then you can surmise that both sides of most local branches wind up being decoded even though both are not used. This means that the IPC of the decoders is higher than the IPC of the ALUs.

In all cases though, it can be stated pretty clearly that x86 decode isn't "free" and has a significant resource cost attached both in relative and absolute terms.

[0] https://tosiron.com/papers/2018/SPEC2017_ISPASS18.pdf

matt_s 3 days ago

Its fun watching things swing back and forth over time. I remember having those Sun mini-fridge size servers, all running RISC sparc based CPU's if I remember correctly. I wonder if there would be some merit in RISC based linux servers, like maybe the power consumption is lower? I forget the pros/cons of RISC vs CISC CPUs.

This item has no comments currently.