• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,252 (7.54/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Action RPG "Atlas Fallen" joins a long line of RPGs this Summer for you to grind into—Baldur's Gate 3, Diablo 4, and Starfield. We've been testing the game for our GPU performance article, and found something interesting—the game isn't optimized for Intel Hybrid processors, such as the Core i9-13900K "Raptor Lake" in our bench. The game scales across all CPU cores—which is normally a good thing—until we realize that not only does it saturate all of the 8 P-cores, but also the 16 E-cores. It ends up with under 80 FPS in busy gameplay at 1080p with a GeForce RTX 4090. Performance is "restored" only when the E-cores are disabled.

Normally, when a game saturates all of the E-cores, we don't interpret it as the game being "aware" of E-cores, but rather "unaware" of them. An ideal Hybrid-aware game should saturate the P-cores for its main workload, and use the E-cores for errands such as processing the audio stack (DSPs from the game), network stack (the game's unique multiplayer network component), physics, in-flight decompression of assets from the disk, etc., which show up in Task Manager as intermittent, irregular load. "Atlas Fallen" appears to be using the E-cores for its main worker threads, and this is found imposing a performance penalty as we found out by disabling the E-cores. This performance penalty is because the E-cores run slower than P-cores, at lower clock speeds, have much lower IPC, and are cache-starved. Frame data being processed by the P-cores end up having to wait for those from the E-cores, which causes the overall framerate to come down.



In the Task Manager screenshot above, the game is running in the foreground, we set Task Manager to be "always on top," so Thread Director won't interfere with the game. It prefers to allocate the P-cores to foreground tasks, which doesn't happen here, because the developers chose to specifically put work on the E-Cores.

For comparison we took four screenshots, with E-Cores enabled and disabled (through BIOS). We picked a "typical average" scene instead of a worst case, which is why the FPS are a bit higher. As you can see, with E-Cores enabled are pretty low (136 / 152 FPS), whereas turning off the E-Cores instantly increases performance right up to the engine's internal FPS cap (187 / 197 FPS).

With the E-cores disabled, the game is confined to what is essentially an 8-core/16-thread processor with just P-cores, which boost well above the 5.00 GHz mark, and have the full 36 MB slab of L3 cache to themselves. The framerate now shoots up to 200 FPS, which is a hard framerate limit set by the developer. Our RTX 4090 should be capable of higher framerates, and developers Deck13 Interactive should consider raising it, given that monitor refresh-rates are on the rise, and it's fairly easy to find a 240 Hz or 360 Hz monitor in the high-end segment. The game is based on the Fledge engine, and supports both DirectX 12 and Vulkan APIs. We used GeForce 536.99 WHQL in our testing. Be sure to check out our full performance review of Atlas Fallen later today.

View at TechPowerUp Main Site
 
Joined
Feb 18, 2005
Messages
5,847 (0.81/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) 3x AOC Q32E2N (32" 2560x1440 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G602
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
Do games themselves have to be optimised/aware of P- versus E-cores? I was under the impression that Intel Thread Director + the Win11 scheduling was sufficient for this, but I guess if there's a bug in either of those components it would also manifest in this regard.
 
Joined
Jun 27, 2019
Messages
2,109 (1.06/day)
Location
Hungary
System Name I don't name my systems.
Processor i5-12600KF 'stock power limits/-115mV undervolt+contact frame'
Motherboard Asus Prime B660-PLUS D4
Cooling ID-Cooling SE 224 XT ARGB V3 'CPU', 4x Be Quiet! Light Wings + 2x Arctic P12 black case fans.
Memory 4x8GB G.SKILL Ripjaws V DDR4 3200MHz
Video Card(s) Asus TuF V2 RTX 3060 Ti @1920 MHz Core/@950mV Undervolt
Storage 4 TB WD Red, 1 TB Silicon Power A55 Sata, 1 TB Kingston A2000 NVMe, 256 GB Adata Spectrix s40g NVMe
Display(s) 29" 2560x1080 75Hz / LG 29WK600-W
Case Be Quiet! Pure Base 500 FX Black
Audio Device(s) Onboard + Hama uRage SoundZ 900+USB DAC
Power Supply Seasonic CORE GM 500W 80+ Gold
Mouse Canyon Puncher GM-20
Keyboard SPC Gear GK630K Tournament 'Kailh Brown'
Software Windows 10 Pro
@btarunr
You mean't Diablo 4 and not 3. :)

Well I don't have E-cores anyway.:laugh: 'looking forward to the performance test '
I'm kind of interested in the game itself since it looks fun enough to me, maybe at a later point tho. 'got enough games to play for now'
 
Joined
Apr 12, 2013
Messages
7,545 (1.77/day)
AMD right about now!
Laugh At Ha Ha GIF by MOODMAN

Intel Thread Director + the Win11 scheduling
OS scheduling is independent of thread director, I'm yet to see what TD actually does & how efficient/better it is to a similar but much better software solution I posted in the other thread!
 

W1zzard

Administrator
Staff member
Joined
May 14, 2004
Messages
27,863 (3.71/day)
Processor Ryzen 7 5700X
Memory 48 GB
Video Card(s) RTX 4080
Storage 2x HDD RAID 1, 3x M.2 NVMe
Display(s) 30" 2560x1600 + 19" 1280x1024
Software Windows 10 64-bit
Do games themselves have to be optimised/aware of P- versus E-cores? I was under the impression that Intel Thread Director + the Win11 scheduling was sufficient for this, but I guess if there's a bug in either of those components it would also manifest in this regard.
Games don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
 
Joined
Feb 18, 2005
Messages
5,847 (0.81/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) 3x AOC Q32E2N (32" 2560x1440 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G602
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
Games don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
Is this simply a result of bad console porting then? Given that current console games don't have to be aware of the difference between P and E cores, since said difference doesn't exist on consoles?
 
Joined
Nov 13, 2007
Messages
10,788 (1.73/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 CL30-36-36-76
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
Disappointing, but I guess i will have to take the 15 seconds to lasso the game.

Probably the same thing on 7950x3d.
 

W1zzard

Administrator
Staff member
Joined
May 14, 2004
Messages
27,863 (3.71/day)
Processor Ryzen 7 5700X
Memory 48 GB
Video Card(s) RTX 4080
Storage 2x HDD RAID 1, 3x M.2 NVMe
Display(s) 30" 2560x1600 + 19" 1280x1024
Software Windows 10 64-bit
Is this simply a result of bad console porting then? Given that current console games don't have to be aware of the difference between P and E cores, since said difference doesn't exist on consoles?
Yeah, or just "bad programming", possibly also "lack of QA testing"

Disappointing, but I guess i will have to take the 15 seconds to lasso the game.
In a quick test I tried setting affinity, didn't have the expected results, FPS are still low. I suspect the game sees x cores on startup and spawns x workers across the available cores. If you later move the x threads onto x minus 16 cores, the workloads clash
 

#22

Joined
Apr 13, 2023
Messages
413 (0.69/day)
Location
Warszawa
I wonder how much has changed in terms of e-cores in general in gaming. E.g. in lastest games or if they are still work in progress and started making bigger difference or their impact looks the same as it was when it was popular to test it (year or two ago).
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Games don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
So in other words; software need to work around this issue. :rolleyes:
Well, I saw this coming when I first heard of Alder Lake.

But if you argue that games should know how many cores are "fast" for spawning threads, then I would argue spawning threads dynamically for any synchronized workload is a bad idea anyways. (it doesn't matter for async workloads though…)

So how do you propose the software should know how many fast cores there are?
Using something like the CPUID instruction on each core? Is there an opcode which says something about the performance of each core?
Or are there new features in the Windows API to query for this?
(Keep in mind that there are already hybrid designs with three different core classes for ARM designs.)
 
Joined
Nov 13, 2007
Messages
10,788 (1.73/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 CL30-36-36-76
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
So in other words; software need to work around this issue. :rolleyes:
Well, I saw this coming when I first heard of Alder Lake.

But if you argue that games should know how many cores are "fast" for spawning threads, then I would argue spawning threads dynamically for any synchronized workload is a bad idea anyways. (it doesn't matter for async workloads though…)

So how do you propose the software should know how many fast cores there are?
Using something like the CPUID instruction on each core? Is there an opcode which says something about the performance of each core?
Or are there new features in the Windows API to query for this?
(Keep in mind that there are already hybrid designs with three different core classes for ARM designs.)

I can count on one hand how many times i've heard or experienced anything like this since Alder lake. As W1z wrote above, the developers needed to do nothing... they did something silly and we got this.
 

W1zzard

Administrator
Staff member
Joined
May 14, 2004
Messages
27,863 (3.71/day)
Processor Ryzen 7 5700X
Memory 48 GB
Video Card(s) RTX 4080
Storage 2x HDD RAID 1, 3x M.2 NVMe
Display(s) 30" 2560x1600 + 19" 1280x1024
Software Windows 10 64-bit
So in other words; software need to work around this issue. :rolleyes:
No, the system is designed to automatically do the right thing, like in 99% of other games out there. The only other case that I know and have researched is CP2077, which does something "smart", using its own scheduler, which makes it fail on X3D

So how do you propose the software should know how many fast cores there are?
There's Windows APIs for that, also various CPU instructions
 
Joined
Feb 18, 2005
Messages
5,847 (0.81/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) 3x AOC Q32E2N (32" 2560x1440 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G602
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
So how do you propose the software should know how many fast cores there are?
Using something like the CPUID instruction on each core? Is there an opcode which says something about the performance of each core?
Or are there new features in the Windows API to query for this?
(Keep in mind that there are already hybrid designs with three different core classes for ARM designs.)
All of these were added with ADL and Win11.
 
Joined
Aug 12, 2022
Messages
248 (0.29/day)
But if you argue that games should know how many cores are "fast" for spawning threads, then I would argue spawning threads dynamically for any synchronized workload is a bad idea anyways. (it doesn't matter for async workloads though…)
Yeah I don't know how games are supposed to use threads but ideally you never put work on another thread that needs to be synchronous. And you especially never want to do that on all cores; what if the OS needs to do something, or the user is streaming, or the user is running a rendering task and limited it to just a few cores so that it wouldn't interfere much with the game? Based on what I've heard here, this game would have abysmal performance in any of those situations even on a 7800X3D.
 
Joined
Aug 12, 2022
Messages
248 (0.29/day)
I suspect performance would also improve if just some of the cores were disabled, even the P-cores. And I suspect performance would also suffer on the 7950X. If the game is trying to use threads that are inter-dependent, then it's spending a lot of CPU resources on thread synchronization, and that will get worse as the core count increases.
 
Joined
Feb 11, 2009
Messages
5,561 (0.96/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
Come on devs, what u doin
 

bug

Joined
May 22, 2015
Messages
13,794 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Games don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
Even so, the behavior is still strange. I mean, Cinebench also puts load on all cores, but still runs faster when also employing the E-cores. There's something fishy in that code, beyond the sloppy scheduling.
 

bug

Joined
May 22, 2015
Messages
13,794 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
The main failure was to introduce P/E idea to desktop CPUs. Who need efficient cores anyway in desktop?
They're a win for people running mutlithreaded workloads. Not because they're "efficient", but because they can squeeze more perf per sq mm (i.e. you can fit 3-4 E-cores where only 2 P-cores would fit and get better performance as a result).

E cores are not a failure, but, like any heterogenous design, results are not uniform anymore, they will vary with workload.
 
Joined
Jun 21, 2019
Messages
44 (0.02/day)
They're a win for people running mutlithreaded workloads. Not because they're "efficient", but because they can squeeze more perf per sq mm (i.e. you can fit 3-4 E-cores where only 2 P-cores would fit and get better performance as a result).

E cores are not a failure, but, like any heterogenous design, results are not uniform anymore, they will vary with workload.

They are but in efficiency limited environment like laptops. If all matters perf per sq mm then all cores should be "efficient" ones. But this is not the case. Efficient cores are nice to have for laptop, to handle background tasks. Having efficient cores in desktop CPU is just waste and lead to performance degradation like here, where you can improve performance by 50% by turning them off.
 
Joined
Aug 12, 2022
Messages
248 (0.29/day)
Even so, the behavior is still strange. I mean, Cinebench also puts load on all cores, but still runs faster when also employing the E-cores. There's something fishy in that code, beyond the sloppy scheduling.
Cinebench probably has different kind of workload. You can give a thread one large task and tell it to communicate at the end, or you can have a task that needs to communicate with other tasks frequently. If each thread has to communicate too frequently, it'll use a lot of its cycles just talking to other threads. And if it's waiting for another thread to finish, then it won't get any work done. The more time each thread spends on its own task without talking to another thread, the better multithreading works.
 

bug

Joined
May 22, 2015
Messages
13,794 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Cinebench probably has different kind of workload. You can give a thread one large task and tell it to communicate at the end, or you can have a task that needs to communicate with other tasks frequently. If each thread has to communicate too frequently, it'll use a lot of its cycles just talking to other threads. And if it's waiting for another thread to finish, then it won't get any work done. The more time each thread spends on its own task without talking to another thread, the better multithreading works.
It definitely has a different kind of workload. But it still doesn't make sense to reduce the overall computing power available and see the performance go up.
 
Top