Comment by NortySpock

NortySpock Oct 5, 2023 parent

I assume for heterogenous cores (power vs efficiency cores) it bottlenecks on the throughput of the slowest core?

DiabloD3 Oct 5, 2023

Surprisingly no. I'd expect similar performance.

In these designs, the actual memory controller that talks to the RAM is part of an internal fabric, and the fabric link between the core and the memory controller is (technically) your upper limit.

For both Intel and AMD, the size of the fabric link remains constant to the expected performance of the different cores, as the theoretical usage/performance of the load/store units remain otherwise constant in relation, no matter if it is a big core or a little core.

Also, notice: the maximum performance of load-store units is your actual upper limit, period. Some CPUs historically never achieved their maximum theoretical performance because the units were never engaged optimally; sometimes this is because some ports on the load/store units are only accessible from certain instructions (often due to being reserved only for SIMD; this is why memcpy impls often use SSE/AVX, just to exploit this fact).

That said, load-store performance usually approaches that core's L2 theoretical maximum, which is greater than what any core generally can get out of its fabric link. Ergo, fabric link is often governing what you're seeing in situations like this.

On Intel and AMD's clusters, the memory controller serving their respective core cluster designs requires anywhere from 2 to 4 cores saturating their links to reach peak performance. Also, sibling threads on the same core will compete for access to that link, so it isn't merely threads that get you there, but actual core saturation.

On a dummy benchmark like proposed in the linked article, the performance of a single process being piped to another process, either in the situation of "both processes are actually on the same big core, simultaneously hyper-threading", or "two sibling little cores in the same core cluster, being serviced by the same memory controller", the upper limit of performance should approximate optimal usage of memory bandwidth, but in some cases on some architectures this will actually approximate L3 bandwidth (a higher value).

Also, as a side note: little cores aren't little. For a little bit more silicon usage, and a little bit less power usage, two little cores approximate one big core /w two threads optimally executing, even in Intel's surprisingly optimal small core design, but very much true in Zen4c. As in, I could buy a "whoops, all little cores" CPU of sufficient size for my desktop, and still be happy (or, possibly, even happier).

This item has no comments currently.