• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel "Raptor Lake" is a 24-core (8 Big + 16 Little) Processor

Joined
Dec 14, 2013
Messages
2,744 (0.68/day)
Location
Alabama
Processor Ryzen 2600
Motherboard X470 Tachi Ultimate
Cooling AM3+ Wraith CPU cooler
Memory C.R.S.
Video Card(s) GTX 970
Software Linux Peppermint 10
Benchmark Scores Never high enough
Way.... too.... may.... lakes
They're just trying to make a splash or sixteen.

TBH I'm not sure how this is going to work out.
With the luck they've had in recent memory I'm starting to think if this doesn't do too well it could blow back in their face like Bulldozer did for AMD. Intel can survive it more readily but at the same time every bit hurts just like it would help.

If Intel doesn't want to keep falling everyone here knows what they need to do but the big question is "how".
I'm not seeing this as being the way, it's more of a hybrid core setup that may have some advantages but I'm thinking it's also going to have more disadvantages than advantages.

However:
Doesn't mean it won't work either, this could well be a suprisingly good chip so let's see what happens before actually labeling it a failure.
 
Joined
Sep 14, 2020
Messages
602 (0.38/day)
Location
Greece
System Name Office / HP Prodesk 490 G3 MT (ex-office)
Processor Intel 13700 (90° limit) / Intel i7-6700
Motherboard Asus TUF Gaming H770 Pro / HP 805F H170
Cooling Noctua NH-U14S / Stock
Memory G. Skill Trident XMP 2x16gb DDR5 6400MHz cl32 / Samsung 2x8gb 2133MHz DDR4
Video Card(s) Asus RTX 3060 Ti Dual OC GDDR6X / Zotac GTX 1650 GDDR6 OC
Storage Samsung 2tb 980 PRO MZ / Samsung SSD 1TB 860 EVO + WD blue HDD 1TB (WD10EZEX)
Display(s) Eizo FlexScan EV2455 - 1920x1200 / Panasonic TX-32LS490E 32'' LED 1920x1080
Case Nanoxia Deep Silence 8 Pro / HP microtower
Audio Device(s) On board
Power Supply Seasonic Prime PX750 / OEM 300W bronze
Mouse MS cheap wired / Logitech cheap wired m90
Keyboard MS cheap wired / HP cheap wired
Software W11 / W7 Pro ->10 Pro
Well we are talking about the 8/16 part in the Op not the single big core variants soooo, I mean yeh you do you but we're talking about what Might do us ,who needs mooooooooar cores.
Actually the "leak" from "Moore's Law is Dead" we are talking about, has many more details and possible configurations. About small workstations I already made a comment earlier.
 
Joined
Apr 24, 2020
Messages
2,745 (1.59/day)
can you explain AVX to me in quick summary, I have never understood it... I am guessing it only applies to work stations not actual gamers?

A normal computer is basically a glorified calculator. If it sees the "add" instruction, then you get "x + y". If it sees the "mul" instruction, it does "x*y".

AVX512 is a special instruction that says "add 512-bits in parallel", so you get "x0 + y0, x1+y1, x2+y2... x15+y15" (16 operations in one clock tick). AVX512 uses more power, but its more efficient than 16-individual add instructions.

Because of this weird "16 different adds" thing going on, its actually very difficult to use. Common programmers largely haven't figured out how to use it, its just not taught in universities. A lot of professional programmers study the instruction set and learn how to do it however. Video-game programmers are surprisingly focused on efficiency, and I've actually seen evidence that video games use these kinds of instructions (ex: Command and Conquer's source code was released recently, and there are all sorts of MMX routines: which was the old version of AVX way back in the 1990s)

AVX512 contains most arithmetic people want: add, subtract, and multiply. (No division!! As is common in 'high performance' instruction sets, division is so inefficient that its often just left out all together). Some video games do use AVX512, but its not very common yet.

AVX2 (256-bit, or 8-operations per instruction) was first deployed in 2014 on 4th generation Intel, and is also fully deployed on AMD Zen. As such, AVX2 (256-bit / 8-operations) is pretty standard these days.
 

Space Lynx

Astronaut
Joined
Oct 17, 2014
Messages
17,471 (4.67/day)
Location
Kepler-186f
Processor 7800X3D -25 all core
Motherboard B650 Steel Legend
Cooling Frost Commander 140
Video Card(s) Merc 310 7900 XT @3100 core -.75v
Display(s) Agon 27" QD-OLED Glossy 240hz 1440p
Case NZXT H710 (Red/Black)
Audio Device(s) Asgard 2, Modi 3, HD58X
Power Supply Corsair RM850x Gold
A normal computer is basically a glorified calculator. If it sees the "add" instruction, then you get "x + y". If it sees the "mul" instruction, it does "x*y".

AVX512 is a special instruction that says "add 512-bits in parallel", so you get "x0 + y0, x1+y1, x2+y2... x15+y15" (16 operations in one clock tick). AVX512 uses more power, but its more efficient than 16-individual add instructions.

AVX512 contains most arithmetic people want: add, subtract, and multiply. (No division!! As is common in 'high performance' instruction sets, division is so inefficient that its often just left out all together). Some video games do use AVX512, but its not very common yet.

AVX2 (256-bit, or 8-operations per instruction) was first deployed in 2014 on 4th generation Intel, and is also fully deployed on AMD Zen. As such, AVX2 (256-bit / 8-operations) is pretty standard these days.

so eventually they will create an AVX specific to quantum computers too, if my logic is correct... ? lol
 
Joined
Apr 24, 2020
Messages
2,745 (1.59/day)
so eventually they will create an AVX specific to quantum computers too, if my logic is correct... ? lol

Quantum is completely different. Quantum "qbits" are "entangled", causing them to be correlated with each other in ways that physics professors understand (but that almost no normal programmer understands). Some professors are studying quantum instructions just in case a practical quantum computer is ever made (in theory, it can do things like RSA factorization, or O(n) sorting / searching routines), but since its impractical and very expensive today... few people bother to learn quantum theories.

AVX is just a weird form of parallelism that's hard to use (but very, very fast and efficient).

If you have a 8-big-core processor with AVX512, you can do 16 operations per core x 8 cores == 128 operations per processor per clock tick.

These AVX instructions are nearly everywhere. So I personally think high-performance programmers should definitely learn how to use them. AVX512 is the next version Intel is making. SVE is ARM's version. GPU programs are implicitly SIMD from the ground up.
 
Joined
Mar 21, 2016
Messages
2,508 (0.78/day)
Well we are talking about the 8/16 part in the Op not the single big core variants soooo, I mean yeh you do you but we're talking about what Might do us ,who needs mooooooooar cores.
Don't ya mean mooooooooooooar glue++++++++!!? :rolleyes: Raptor Lake featuring Intels finest leading edge nanometre glue technology.
 
Joined
Feb 20, 2020
Messages
9,340 (5.22/day)
Location
Louisiana
System Name Ghetto Rigs z490|x99|Acer 17 Nitro 7840hs/ 5600c40-2x16/ 4060/ 1tb acer stock m.2/ 4tb sn850x
Processor 10900k w/Optimus Foundation | 5930k w/Black Noctua D15
Motherboard z490 Maximus XII Apex | x99 Sabertooth
Cooling oCool D5 res-combo/280 GTX/ Optimus Foundation/ gpu water block | Blk D15
Memory Trident-Z Royal 4000c16 2x16gb | Trident-Z 3200c14 4x8gb
Video Card(s) Titan Xp-water | evga 980ti gaming-w/ air
Storage 970evo+500gb & sn850x 4tb | 860 pro 256gb | Acer m.2 1tb/ sn850x 4tb| Many2.5" sata's ssd 3.5hdd's
Display(s) 1-AOC G2460PG 24"G-Sync 144Hz/ 2nd 1-ASUS VG248QE 24"/ 3rd LG 43" series
Case D450 | Cherry Entertainment center on Test bench
Audio Device(s) Built in Realtek x2 with 2-Insignia 2.0 sound bars & 1-LG sound bar
Power Supply EVGA 1000P2 with APC AX1500 | 850P2 with CyberPower-GX1325U
Mouse Redragon 901 Perdition x3
Keyboard G710+x3
Software Win-7 pro x3 and win-10 & 11pro x3
Benchmark Scores Are in the benchmark section
Hi,
Seems this should really be named aneurysms lake lol
 
Joined
Mar 10, 2010
Messages
11,880 (2.19/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
Don't ya mean mooooooooooooar glue++++++++!!? :rolleyes: Raptor Lake featuring Intels finest leading edge nanometre glue technology.
I nearly shouted back "who doesn't love glue" but shit there's too many connotations there.
I'm a fan of glue if it brings new tech, tech glue obvs.
 
Joined
Jun 12, 2017
Messages
136 (0.05/day)
A normal computer is basically a glorified calculator. If it sees the "add" instruction, then you get "x + y". If it sees the "mul" instruction, it does "x*y".

AVX512 is a special instruction that says "add 512-bits in parallel", so you get "x0 + y0, x1+y1, x2+y2... x15+y15" (16 operations in one clock tick). AVX512 uses more power, but its more efficient than 16-individual add instructions.

Because of this weird "16 different adds" thing going on, its actually very difficult to use. Common programmers largely haven't figured out how to use it, its just not taught in universities. A lot of professional programmers study the instruction set and learn how to do it however. Video-game programmers are surprisingly focused on efficiency, and I've actually seen evidence that video games use these kinds of instructions (ex: Command and Conquer's source code was released recently, and there are all sorts of MMX routines: which was the old version of AVX way back in the 1990s)

AVX512 contains most arithmetic people want: add, subtract, and multiply. (No division!! As is common in 'high performance' instruction sets, division is so inefficient that its often just left out all together). Some video games do use AVX512, but its not very common yet.

AVX2 (256-bit, or 8-operations per instruction) was first deployed in 2014 on 4th generation Intel, and is also fully deployed on AMD Zen. As such, AVX2 (256-bit / 8-operations) is pretty standard these days.
AVX is not something rare. It can be auto-generated by the compiler. If you ever turn on -O3 optimization flag (which includes -ftree-loop-vectorize option) in GCC, then you get auto vectorization for all eligible loops in your code. If you write simple loops, they will be transformed in to equivalent AVX/SSE code depending on your target architecture.

Normally, for education purpose and open source communities, -O3 is not something you see everyday. But for production usages, -O3 is almost enabled everywhere.
 
Joined
Jul 7, 2019
Messages
944 (0.47/day)
And people will be buying the wagon with the 8 horses and the 16 dogs, thinking they are getting 24 horses, while AMD will be trying to sell a cart with "only" 16 horses.
Missed opportunity there for it to have been 8 Clydesdales and 16 Miniature Ponies, while AMD remains consistent with 16 Clydesdales. 24 horses vs 16, but not all 24 are the same.
 
Joined
Apr 24, 2020
Messages
2,745 (1.59/day)
AVX is not something rare. It can be auto-generated by the compiler. If you ever turn on -O3 optimization flag (which includes -ftree-loop-vectorize option) in GCC, then you get auto vectorization for all eligible loops in your code. If you write simple loops, they will be transformed in to equivalent AVX/SSE code depending on your target architecture.

Normally, for education purpose and open source communities, -O3 is not something you see everyday. But for production usages, -O3 is almost enabled everywhere.

Autogenerated AVX is not very good in my experience.

Good AVX code requires understanding cache alignment, array-of-structs vs struct-of-arrays and other such concepts that just aren't taught. One of the most important paradigms is prefix-sum, and that's just... not taught anywhere as far as I can tell.

Autogenerated AVX only really works on the simplest of SAXY code that a college-student can probably parallelize. Its nice that a lot of these simple cases are auto-handled by a compiler these days, but really good code requires understanding the SIMD programming model
 
Joined
Jun 10, 2014
Messages
3,010 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
can you explain AVX to me in quick summary, I have never understood it... I am guessing it only applies to work stations not actual gamers?
AVX like other forms of SIMD addresses data level parallelism, which is whenever you want to do the same operation on multiple entities of data. By performing multiple math operations in a single instruction, you don't just save those instructions, you also save "overhead" in terms of looping and moving data around in registers, plus you gain efficiency for the instruction cache. Done right, the performance gains could be massive. But doing it right usually requires hand written low-level code.

Think of it as hardware dedicated to a certain task, very much the same as hardware accelerated playback for say H264 on a GPU
Most of AVX is actually general purpose, only some smaller subsets addresses specific algorithms or things like "AI".

24c ~ that's their selling point, come on it's not like the masses will know 5950x(t?) will pulverize it in MT tasks! They'll probably go OMG 24 cores, without actually looking at what the cores are capable of :shadedshu:
Most desktops are sold through the large PC manufacturers, where "specs" sell. That's the primary motivation for this hybrid core nonsense on the desktop.

Games use AVX all the time.
Fairly few games use AVX directly, but hopefully more will soon.
 
Joined
Jun 12, 2017
Messages
136 (0.05/day)
Autogenerated AVX is not very good in my experience.

Good AVX code requires understanding cache alignment, array-of-structs vs struct-of-arrays and other such concepts that just aren't taught. One of the most important paradigms is prefix-sum, and that's just... not taught anywhere as far as I can tell.

Autogenerated AVX only really works on the simplest of SAXY code that a college-student can probably parallelize. Its nice that a lot of these simple cases are auto-handled by a compiler these days, but really good code requires understanding the SIMD programming model
Yes, auto vectorization does not do much, but they do carry around 3% improvement in most packages (I remember seeing that figure from some Gentoo compilation tests). That's something for free still.
 
Joined
Jun 10, 2014
Messages
3,010 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Yes, auto vectorization does not do much, but they do carry around 3% improvement in most packages (I remember seeing that figure from some Gentoo compilation tests). That's something for free still.
It's much better than that. It might be 3% in average, across all kinds of software. But with more performant code, which usually is more cache optimized, we can often see 10-15% or more, so the free gains can be quite significant.
But with that being said, auto vectorization is still nothing compared to well used AVX intrinstics.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Yes, auto vectorization does not do much, but they do carry around 3% improvement in most packages (I remember seeing that figure from some Gentoo compilation tests). That's something for free still.
Except it's not for free. AVX uses more power so while you might gain some performance, you're getting it at the cost of additional power usage because more circuitry simply translates to more heat. In the case of auto-vectorization, it's likely not worth the cost if you care about power efficiency. The point of the chopped down cores is to be efficient, not fast. It is more efficient to use smaller cores for things that can't be significantly improved by AVX or where the time you have to complete the computation is long enough to be processed by a slower and cut down core. Also, implementing parallelization in the code itself is likely going to be easier to letting auto-vectorization figure it out, getting intrinsics to work well, or flat out writing assembly for those cases.

So even if a task can use AVX, depending on the use case, it might be overkill. Putting AVX on power efficient cores is practically a contradiction as well because AVX is anything but efficient in terms of power consumption because it's a very wide execution unit. Same deal with cache. SRAM takes up a lot of space and makes a good amount of heat.
 
Joined
Jun 10, 2014
Messages
3,010 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Except it's not for free. AVX uses more power so while you might gain some performance, you're getting it at the cost of additional power usage because more circuitry simply translates to more heat.
This is very inaccurate.
AVX does not use more power per computation, it's actually far more efficient. The only times AVX loads use "more power" is when they perform a lot more work that you could do without AVX.
As a matter of fact, in all x86-64 designs, all FPU operations are fed through AVX-/SSE-units. The difference is whether the AVX units are fed single or multiple pieces of data.
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
This is very inaccurate.
AVX does not use more power per computation, it's actually far more efficient. The only times AVX loads use "more power" is when they perform a lot more work that you could do without AVX.
As a matter of fact, in all x86-64 designs, all FPU operations are fed through AVX. The difference is whether the AVX units are fed single or multiple pieces of data.
Sure, if you're filling up all of the FPUs involved in each AVX op. FPUs that aren't being fully used in the vector math still add heat because you don't need the full width and aren't powergated. So unless you're building it to fully load them up, you're not going to see those kinds of gains. Simply put, it's only efficient if it's been tuned to be. Auto-vectorization is really unlikely to get you that level occupancy when running AVX ops. It's only efficient if you can fill the vector.
 
Last edited:
Joined
Jun 10, 2014
Messages
3,010 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Sure, if you're filling up all of the FPUs involved in each AVX op. FPUs that aren't being fully used in the vector math still add heat because you don't need the full width and aren't powergated. So unless you're building it to fully load them up, you're not going to see those kinds of gains. Simply put, it's only efficient if it's been tuned to be. Auto-vectorization is really unlikely to get you that level occupancy when running AVX ops. It's only efficient if you can fill the vector.
The CPU will run it through the vector units (AVX units) regardless. Single FPU operations are the ones which will cause unfilled vectors, and be the least efficient. Auto vectorization will only occur when the data and operations are aligned properly, and will not get less efficiency than single FPU operations (through vector units).
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
The CPU will run it through the vector units (AVX units) regardless. Single FPU operations are the ones which will cause unfilled vectors, and be the least efficient. Auto vectorization will only occur when the data and operations are aligned properly, and will not get less efficiency than single FPU operations (through vector units).
I seriously doubt that the full width is active when you're running a normal FP op because the width of the op being asked for is static. This is a case where the rest of it is probably powergated because it's known how wide the FPU(s) needs to be, but in the case of actually using AVX, the full width of the unit is active even if you can't get full occupancy of the vector that you're doing the op on. There is a difference between the two.
 
Joined
Jun 10, 2014
Messages
3,010 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
I seriously doubt that the full width is active when you're running a normal FP op because the width of the op being asked for is static. This is a case where the rest of it is probably powergated because it's known how wide the FPU(s) needs to be, but in the case of actually using AVX, the full width of the unit is active even if you can't get full occupancy of the vector that you're doing the op on. There is a difference between the two.
Even if it managed to power gate based on data width, it still wouldn't change the fact that 16 cycles of single floats would consume way more power than a single iteration of 16 floats, not to mentions the additional MOVs and iterations saved on top of that. So your claim of AVX being less efficient is fundamentally flawed.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
it still wouldn't change the fact that 16 cycles of single floats would consume way more power than a single iteration of 16 floats
That's where you're wrong. If you can spare the time, a slim core running at lower clock speeds with less cache will actually be more efficient in terms of power used even if takes more cycles to complete because you're driving those transistors at lower speeds, lower voltages, and there are overall fewer transistors which means less leakage. You're only correct if the workload is big and for general tasks, they're not. You're not using these cores to play games. You're using them when you're using the browser or doing things that can afford the latency penalty.

I'm not arguing that AVX isn't faster or more efficient under full load. I'm saying that unless you're under full load, AVX and big cores aren't in comparison to slim cores tuned for power efficiency.

it still wouldn't change the fact that 16 cycles of single floats would consume way more power than a single iteration of 16 floats, not to mentions the additional MOVs and iterations saved on top of that.
You must have missed the part where I said:
Sure, if you're filling up all of the FPUs involved in each AVX op. FPUs that aren't being fully used in the vector math still add heat because you don't need the full width and aren't powergated. So unless you're building it to fully load them up, you're not going to see those kinds of gains. Simply put, it's only efficient if it's been tuned to be. Auto-vectorization is really unlikely to get you that level occupancy when running AVX ops. It's only efficient if you can fill the vector.
If you have a really heavy workload like that, you probably are already using the high power cores. You're assuming that the vector is always 16 values wide when that's highly unlikely unless the code has been specifically tuned to use the entire width of an AVX op. The reality is that most of the time, full AVX width isn't used unless the software is tuned for it and that work isn't going to be done unless it's necessary and will benefit from it. Either way, you're running on the assumption that AVX is going to help everything and it's not. The low power cores are to handle the other cases which happen far more frequently than hitting AVX.

With that said, if you don't believe me, just look at the power consumption numbers for Apple's M1 chip and I assure you it's not just because it's ARM and on a smaller process.
 
Joined
Mar 21, 2016
Messages
2,508 (0.78/day)
I could see Intel taking the cache structure and have two identical cores rather than big LITTLE where they adjust the typical cache structure between two otherwise identical chip dies. Probably the L3 would be identical so you could get full multi thread performance across the L3 cache. The L1 and L2 could be more varied where one die might have a bigger L1 and smaller L2 while the other would have a smaller L1 and bigger L2. They could basically trade off L1/L2 cache sizes for cache latency between dies and then depending on the task pick the faster of the two up to the chip die core and threading limits of the L1/L2 caches. They might even be capable of combing together and operating at the slower of the L1/L2 cache sizes latency values though possibly limited to double that of the smaller cache size. That would still be better than a cache hit miss and using next slower cache L2/L3 however.
1623538653570.png


To summary in most instances it would likely be faster, but in a few instances it could be marginally slower for the L1 and L2 cache up to a point. That said it would be marginally slower some clock cycles, but has the by product of less waste heat in turn and make the chip dies able to turbo boost higher or longer duration the follow clock cycle so balances out a bit. It both provides some ample power down savings and heat reduction while improving the L1/L2 cache performance reasonable amount of time. Perhaps they could bake that into each die so the OS sees each core as identical, but the core itself is selective at only running half the threads when and where needed as opposed to all of them. Basically it could be like a more true hyperthreading on alternating clock cycles at times or tiny cache latency penalty in some heavily multithreaded instances though still a lot better than bigLITTLE on performance.
 
Joined
Jul 9, 2015
Messages
3,413 (0.98/day)
System Name M3401 notebook
Processor 5600H
Motherboard NA
Memory 16GB
Video Card(s) 3050
Storage 500GB SSD
Display(s) 14" OLED screen of the laptop
Software Windows 10
Benchmark Scores 3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.
arm does low power better.
Ah, so that is why it gets beaten by EPIC, or why M1 (5nm) is getting beaten by Ryzen 4000 series (meh 7nm)... oh, doh, wait... :D

Current CPUs can execute ~3 AVX or 3 normal add instructions per cycle. So the 16X data points = 16X faster.
This does not explain why one would use CPU over GPU (thousands of shaders) for it.

And in general, why do we need that big/little thing again?
ASUS gaming (!!!) laptop with (amazing) 6800m in it had 10+ hour battery life (browsing/video :), not gaming of course), Ryzen 5000 series can get close to 20 hours, so what is the point?

Keep in mind that:
1623560794472.png


 
Joined
Jun 10, 2014
Messages
3,010 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
That's where you're wrong. If you can spare the time, a slim core running at lower clock speeds with less cache will actually be more efficient in terms of power used even if takes more cycles to complete because you're driving those transistors at lower speeds, lower voltages, and there are overall fewer transistors which means less leakage.
You are really grasping at straws here.
Doing the same work without AVX (or other SIMD) would usually require >20x the instructions, and you want to offset that extra power required by running the core at a very low clock speed, probably making it about 100x slower, this is not a very realistic usage scenario.
The fact remains that AVX is more power efficient.

I'm not arguing that AVX isn't faster or more efficient under full load. I'm saying that unless you're under full load, AVX and big cores aren't in comparison to slim cores tuned for power efficiency.
I assume you are still talking in the context of auto-vectorizing here.
Your assumptions here about saturating the vector units is fundamentally flawed. Auto-vectorizing only happens when the data is dense and the operations in a loop easily translates to AVX operations. It's not like the compiler will take random FPU operations and stuff them together in vectors.

Auto-vectorization will not hurt your efficiency or performance, but there are some considerations;
- Sometimes the gains are negligible, because the code is too bloated, the data isn't dense and/or the operations inside the loops aren't simple enough.
- If FMA is enabled, the data produced will no longer be binary compatible, which may or may not be a problem.

With that said, if you don't believe me, just look at the power consumption numbers for Apple's M1 chip and I assure you it's not just because it's ARM and on a smaller process.
ARM achieve efficiency with special instructions to accelerate specific workloads, and yes, ASIC will beat SIMD in efficiency, but SIMD is general purpose.
 

Mussels

Freshwater Moderator
Joined
Oct 6, 2004
Messages
58,413 (7.89/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
Hell, i learned a fair bit about AVX from you nerds arguing


keep it up, education is good
 
Top