- Joined
- Nov 9, 2008
- Messages
- 2,318 (0.39/day)
- Location
- Texas
System Name | Mr. Reliable |
---|---|
Processor | Ryzen R7 7800X3D |
Motherboard | MSI X670E Carbon Wifi |
Cooling | D5 Pump, Singularity Top/Res, 2x360mm EK P rads, EK Magnitude/Bitspower Blocks |
Memory | 32Gb (2x16Gb) GSkill Trident Z5 DDR5 6000 Cl30 |
Video Card(s) | Asus Tuf 4080 Super |
Storage | 4 x Crucial P5 1TB; 2 x Samsung 870 2TB |
Display(s) | Acer 32" Z321QU 2560x1440; LG 34GP83A-B 34" 3440x1440 |
Case | Lian Li PC-011 Dynamic XL; Synology DS218j w/ 2 x 2TB WD Red |
Audio Device(s) | SteelSeries Arctis Pro+ |
Power Supply | EVGA SuperNova 850G3 |
Mouse | Razer Basilisk V2 |
Keyboard | Das Keyboard 6; Razer Orbweaver Chroma |
Software | Windows 11 Pro |
No, it proves that the Scheduler was unable to saturate those CUs with a single task.
If parallelizing two tasks requiring the same resources yields a performance increase, then some resources had to be idling in the first place, because they were unable to get instructions from the Scheduler. Any alternative would be impossible.
The difference is in the way tasks are handed out, and the whole point is to get more instructions to idle shaders. But they are two dramatically different approaches. NVidia is best using limited async with instructions running in a mostly serial nature.
Source: https://scalibq.wordpress.com/So that is the way nVidia approaches multiple workloads. They have very high granularity in when they are able to switch between workloads. This approach bears similarities to time-slicing, and perhaps also SMT, as in being able to switch between contexts down to the instruction-level. This should lend itself very well for low-latency type scenarios, with a mostly serial nature. Scheduling can be done just-in-time.
AMD on the other hand seems to approach it more like a ‘multi-core’ system, where you have multiple ‘asynchronous compute engines’ or ACEs (up to 8 currently), which each processes its own queues of work. This is nice for inherently parallel/concurrent workloads, but is less flexible in terms of scheduling. It’s more of a fire-and-forget approach: once you drop your workload into the queue of a given ACE, it will be executed by that ACE, regardless of what the others are doing. So scheduling seems to be more ahead-of-time (at the high level, the ACEs take care of interleaving the code at the lower level, much like how out-of-order execution works on a conventional CPU).
And until we have a decent collection of software making use of this feature, it’s very difficult to say which approach will be best suited for the real-world. And even then, the situation may arise, where there are two equally valid workloads in widespread use, where one workload favours one architecture, and the other workload favours the other, so there is not a single answer to what the best architecture will be in practice.
This is why NVidia cards shine so well, APIs today send out instructions in a mostly serial nature, wherein preemption works relatively well...however the new APIs are able to be used with inherently parallel workloads, which causes AMD cards to shine.
Please bear in mind I am not bashing either approach, NV cards are pure muscle, and I love it! but that also comes with a price. AMDs approach to bring that kind of power without needing the brute force approach is good for everyone, and is more cost effective when utilized correctly.
Last edited: