Intel "Raptor Lake" is a 24-core (8 Big + 16 Little) Processor

Aquinus · Jun 14, 2021

efikkan said:
You are really grasping at straws here.
Doing the same work without AVX (or other SIMD) would usually require >20x the instructions, and you want to offset that extra power required by running the core at a very low clock speed, probably making it about 100x slower, this is not a very realistic usage scenario.
The fact remains that AVX is more power efficient.

Once again, you're assuming the vector is completely occupied which is a bad assumption. I'm not saying your wrong, I'm saying your premise is bad. Also you're also assuming that the increased time to execute is going to harm performance when you have no idea if it's the bottleneck for the application. If it's running on a low power core, it's probably not, otherwise it wouldn't be running there. The reality is that the case you're describing is the case that'd already be hitting the high power cores.

efikkan said:
I assume you are still talking in the context of auto-vectorizing here.
Your assumptions here about saturating the vector units is fundamentally flawed. Auto-vectorizing only happens when the data is dense and the operations in a loop easily translates to AVX operations. It's not like the compiler will take random FPU operations and stuff them together in vectors.

Auto-vectorization will not hurt your efficiency or performance, but there are some considerations;
- Sometimes the gains are negligible, because the code is too bloated, the data isn't dense and/or the operations inside the loops aren't simple enough.
- If FMA is enabled, the data produced will no longer be binary compatible, which may or may not be a problem.

Agreed on FMA. If you vectorize an operation though, there is a power penelty from driving the full width of AVX if you're not using the entire thing. Even half occupancy in situations where time isn't the limiting factor, a smaller, slim core is likely going to use less power. I think you underestimate how many more transistors it takes to implement AVX and the cache backing to support it. Just because auto-vectorization runs doesn't mean that it's always a perfect situation where you'll get 100% occupancy. Sure, performance might not get worse, but power consumption can.

efikkan said:
ARM achieve efficiency with special instructions to accelerate specific workloads, and yes, ASIC will beat SIMD in efficiency, but SIMD is general purpose.

...and Apple specifically made the choice to make slim cores that couldn't do everything for that case. It's not like the slim cores are ASIC circuits. That hardware is going to be in the high power cores. Once again, you're focusing on full load, high power situations. Outside of that context, it doesn't make sense to have that kind of bulk in the slim cores because you're negating the advantage they have and you might as well just have a bunch of high power cores instead, but the reality is that most CPUs aren't running full tilt most of the time and there are efficiency benefits to be had from that.

Mussels said:
Hell, i learned a fair bit about AVX from you nerds arguing

keep it up, education is good

Yes it is, but so is properly understanding the problem being solved.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

Intel "Raptor Lake" is a 24-core (8 Big + 16 Little) Processor

Aquinus

Resident Wat-man