Going back to
Anandtech's SPEC testing of the M1 Max, I think there's a clue there as to the wildly divergent performance results across different benchmarks: reliance on integer vs. floating point math as well as memory bandwidth. In AT's testing, the M1 Max (8P+2E, 10t) outperformed the 8P, 16t Ryzen 7 5800X by 4.8% (53.38 vs. 50.98 points), but only delivered 64% of the 5950X's performance (83,13 points) in the SPECint suite. In SPECfp on the other hand, the M1 Max outperforms the 5800X by 72.1% (81.07 points vs. 47.1), and even trounces the 5950X by 25.9% (64.39 points). Apple's scores drop somewhat if the Icestorm efficiency cores are excluded, to 48.57 and 75.67 points, which puts it behind the 5800X in integer, but still ahead of the 5950X in floating point. AT notes that "The fp2017 suite has more workloads that are more memory-bound". In one SPECfp nT subtest, the M1 Max beats the Ryzen 9 5980HS (mobile, 35W, 8c16t) by a staggering
4.8 times. It's an outlier, but illustrates what can happen in a memory bound edge case.
Also worth noting: AT measures per-core memory access speeds for the M1 to be significantly lower than the total bandwidth of the memory interface. This is also true for other architectures (you need more than 1c to max out your bandwidth), but it makes direct bandwidth comparisons troublesome.
Still, this tells us several things:
- The Apple Firestorm cores (1t) are ever so slightly behind AMD's Zen 3 cores (2t) in integer workloads, but are almost matched core for core.
- The Apple Firestorm cores have a
massive advantage over Zen3 in floating point workloads, at least those present in the SPEC suite, delivering more than 2x the performance per core. This has the caveat that these workloads are more memory bound.
- nT scaling is quite different across the architectures and workloads: the 5950X scales 10.87x from 1t to 32t in int and 5.3x in fp. The 5800X is sadly not in the 1t chart, but should be marginally slower than the 5950X (2-300MHz). If they were identical, its scaling would be 6.7x and 3.9x from 1t to 16t, so it's likely a tad higher than that. The M1 Max scales by 7.1x and 6.3x from 1t to 10t. Thread scaling comparisons are made difficult by Apple's big.little architecture and AMD's SMT, but it does seem that the higher memory bandwidth helps Apple scale better with additional cores to some extent (though this could also be affected by many other factors, including software). Or this could be formulated as AMD's MSDT platform being significantly held back by memory bandwidth.
How does this affect the discussions here? Well, it's well documented that Cinebench doesn't care much about memory bandwidth. It likes higher IF and memory clocks on AMD, but it scales poorly with more memory channels, meaning latency is more important than bandwidth. Other workloads are quite different - but they tend to be quite specialized. There are essentially no consumer workloads that are particularly bandwidth limited, with most caring more about compute power or latency.
Does this mean Cinebench is a more, or less, reliable benchmark? Depends what you're looking for. It's clear that Apple has designed the larger M1 chips for bandwidth-hungry applications (though the large GPU sharing it also needs that, of course). So, if your workloads are bound by memory bandwidth, the M1 Ultra and its siblings are likely to deliver staggeringly good performance and efficiency. If not? Then it's likely competitive with competing CPUs with a similar number of cores (not threads), but YMMV depending on the workload.
Does this mean Apple lied? Again, no. There's no doubt that the M1 Ultra is the most powerful
chip for PCs ever made. Not CPU, not GPU, but in sum.
Being a single SoC has some great advantages, particularly in efficiency (near zero interconnect power, far less embodied energy in duplicate componentry), but also to some extent in latency-sensitive performance scenarios. But as you say, it's also inflexible, and those advantages don't necessarily apply to all workloads. Both approaches have distinct pros and cons. You can't get past the fact that a tightly integrated package will always be more efficient than a collection of discrete components (as long as they are othewise comparable in terms of architectural efficiency). There's a reason why a 5W smartphone delivers
a lot more than 1/20th the performance of a 100W laptop. That doesn't invalidate the value or performance of the laptop - after all, it delivers a degree of performance impossible in a smartphone.
As for that fp64 comparison: according to
this source the M1 GPU arch has 1/4 speed fp64, so it should deliver ~2.6TF/s of FP64 (assuming 10.4TF FP32 numbers online are accurate). For comparison,
Ampere has 1/32 speed FP64, and delivers ~1.21TF/s of FP64. Not that this necessarily matters much: the reason Ampere performs poorly in FP64 is that double precision floating point math has become ever more of a niche application, and Nvidia thus doesn't prioritize it whatsoever outside of their datacenter GA100 chips, which have 1:2 FP64 (and the A100 80GB SXM4 delivers ~9.746TF/s of FP64). As such, it seems that Apple cares more about FP64 than Nvidia does outside of datacenters, likely indicating that pro applications common on MacOS tend to use that a bit more than most PC applications (though it might also just be a small but profitable subset of apps, wanting to cater to a niche that pays well).