Thursday, August 10th 2023

Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

Action RPG "Atlas Fallen" joins a long line of RPGs this Summer for you to grind into—Baldur's Gate 3, Diablo 4, and Starfield. We've been testing the game for our GPU performance article, and found something interesting—the game isn't optimized for Intel Hybrid processors, such as the Core i9-13900K "Raptor Lake" in our bench. The game scales across all CPU cores—which is normally a good thing—until we realize that not only does it saturate all of the 8 P-cores, but also the 16 E-cores. It ends up with under 80 FPS in busy gameplay at 1080p with a GeForce RTX 4090. Performance is "restored" only when the E-cores are disabled.

Normally, when a game saturates all of the E-cores, we don't interpret it as the game being "aware" of E-cores, but rather "unaware" of them. An ideal Hybrid-aware game should saturate the P-cores for its main workload, and use the E-cores for errands such as processing the audio stack (DSPs from the game), network stack (the game's unique multiplayer network component), physics, in-flight decompression of assets from the disk, etc., which show up in Task Manager as intermittent, irregular load. "Atlas Fallen" appears to be using the E-cores for its main worker threads, and this is found imposing a performance penalty as we found out by disabling the E-cores. This performance penalty is because the E-cores run slower than P-cores, at lower clock speeds, have much lower IPC, and are cache-starved. Frame data being processed by the P-cores end up having to wait for those from the E-cores, which causes the overall framerate to come down.
In the Task Manager screenshot above, the game is running in the foreground, we set Task Manager to be "always on top," so Thread Director won't interfere with the game. It prefers to allocate the P-cores to foreground tasks, which doesn't happen here, because the developers chose to specifically put work on the E-Cores.

For comparison we took four screenshots, with E-Cores enabled and disabled (through BIOS). We picked a "typical average" scene instead of a worst case, which is why the FPS are a bit higher. As you can see, with E-Cores enabled are pretty low (136 / 152 FPS), whereas turning off the E-Cores instantly increases performance right up to the engine's internal FPS cap (187 / 197 FPS).

With the E-cores disabled, the game is confined to what is essentially an 8-core/16-thread processor with just P-cores, which boost well above the 5.00 GHz mark, and have the full 36 MB slab of L3 cache to themselves. The framerate now shoots up to 200 FPS, which is a hard framerate limit set by the developer. Our RTX 4090 should be capable of higher framerates, and developers Deck13 Interactive should consider raising it, given that monitor refresh-rates are on the rise, and it's fairly easy to find a 240 Hz or 360 Hz monitor in the high-end segment. The game is based on the Fledge engine, and supports both DirectX 12 and Vulkan APIs. We used GeForce 536.99 WHQL in our testing. Be sure to check out our full performance review of Atlas Fallen later today.
Add your own comment

120 Comments on Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

#51
Solaris17
Super Dainty Moderator
W1zzardAtlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
Middle of last year I rotated with a team of nothing but SDEs. They were having issues with performance on one of the new services we were spinning up. We were recording remote sessions and encoding them into video to be retrieved later.

They were not understanding why we were burning 192 core AMD systems and still getting poor performance. All of these guys were pretty removed from HW in general. I explained to them that we need to switch to our GPU compute cluster instead of using CPU threads since the GPUs can do HW En/Decode they were legit shocked.

We switched. Saved hundreds of thousands in internal costs took like 2 weeks for them to recode for GPUs. I got a promotion out of it. I rotated off the team not understanding how they made it that far.

Sometimes these guys literally just sit infront of a game engine and check a box "use all available CPU cores" I swear to god. I was on another team about 8 months later. I had to explain to a TAM (thankfully not an engineer) why 10gb/s links on our storage offload system did NOT mean 10 gigaBYTES/s and that the time quotes they were giving were going to be drastically off.

They got paid more than me.

Always shoot for the stars in your careers people. Even if you dont think you can cut it. The sky is already full of some pretty dim ones.
Posted on Reply
#54
R0H1T
How are they dominating? They literally had to disable AVX512 in ADL, in fact RPL could also have it just permanently(?) disabled.

Also I'm yet to see how/what thread director does? Does anyone actually have any benchmarks for it :wtf:
Solaris17Middle of last year I rotated with a team of nothing but SDEs. They were having issues with performance on one of the new services we were spinning up. We were recording remote sessions and encoding them into video to be retrieved later.
Most of them I have worked with have no idea about the latest & greatest in hardware, not that they always need to, but you'd think they'd at least try to make themselves acquainted with something as fundamental to their work?
Posted on Reply
#55
zlobby
Who would have thought? /s
Posted on Reply
#56
unwind-protect
Would be interesting to run the same test on a 7950x, once with all 16 cores and once with just 8 cores enabled.

That would tell you whether this is a thread synchronization problem that ends up having so much locking overhead that adding cores makes it slower instead of faster.
Posted on Reply
#57
Od1sseas
bugThey're a win for people running mutlithreaded workloads. Not because they're "efficient", but because they can squeeze more perf per sq mm (i.e. you can fit 3-4 E-cores where only 2 P-cores would fit and get better performance as a result).

E cores are not a failure, but, like any heterogenous design, results are not uniform anymore, they will vary with workload.
4 E-Cores=1 P-Core
Posted on Reply
#58
Battler624
I skimmed through this but I don't see you mentioning if you disabled E-Cores via Bios or just via task-manager affinity
Posted on Reply
#59
THU31
bugThey're a win for people running mutlithreaded workloads. Not because they're "efficient", but because they can squeeze more perf per sq mm (i.e. you can fit 3-4 E-cores where only 2 P-cores would fit and get better performance as a result).
Using the 13600K as an example, 8 E-cores offer ~60% more performance compared to 2 P-cores, while using ~40% more power. That's roughly a ~15% gain in efficiency, which is completely irrelevant on desktop.
That's why it actually makes no sense to have a 6P+8E SKU on desktop, when you could have an 8P+0E SKU offering very similar performance and power consumption.

Desktops are always getting the same chips as laptops. They could easily make a 12-core die instead of 8P+16E, but they wouldn't do it just for desktops. Besides, it's great marketing when your top CPU has 24 cores while the competition only has 16.

I don't mind them putting E-cores into i7's and i9's to offer more than 8 cores total, but including E-cores with fewer than 8 P-cores is just IDIOTIC.

There's also a reason why Sapphire Rapids server CPUs don't have E-cores. There's no need whatsoever.
Posted on Reply
#60
phanbuey
THU31Using the 13600K as an example, 8 E-cores offer ~60% more performance compared to 2 P-cores, while using ~40% more power. That's roughly a ~15% gain in efficiency, which is completely irrelevant on desktop.
That's why it actually makes no sense to have a 6P+8E SKU on desktop, when you could have an 8P+0E SKU offering very similar performance and power consumption.

Desktops are always getting the same chips as laptops. They could easily make a 12-core die instead of 8P+16E, but they wouldn't do it just for desktops. Besides, it's great marketing when your top CPU has 24 cores while the competition only has 16.

I don't mind them putting E-cores into i7's and i9's to offer more than 8 cores total, but including E-cores with fewer than 8 P-cores is just IDIOTIC.

There's also a reason why Sapphire Rapids server CPUs don't have E-cores. There's no need
13600K got priced out ultimately, but e cores are not efficiency, they're space saving with a little efficiency sprinkled in. When the 13600 cost as much as the 7700x, and with early AM5 it was a much better value.

Nowadays not so much.
Posted on Reply
#61
KLMR
I think the real strategy behind is widen the gap between desktop pcs and workstation-server computers. As it is: reduce pcie lanes, drop AVX512, drop ECC, increase M2 slots, etc.
Camouflage? The last 100 or 200 Mhz you can squezee from 50-100W on 8 "performance" cores.

No desktop consumer wins with E cores. No when you can park your cores for specific package power usages.
Posted on Reply
#62
Crackong
For those who seems confusing about AMD's version of P&E core :

AMD is using a regular core (P) and cache-reduced core (E)
They are in the SAME architecture, supports the SAME instructions, and works the SAME way in computing tasks.
The only difference is the cache size, which affects the speed of data fetching, so the small cores is slower in certain tasks.
So the system can treat them as "Faster core & Slower core" .
Modern OS deal with that approach for years ever since core boosting is introduced.
Therefore should have no problem loading multi threads within a single programme into AMD's P&E cores simultaneously.

Intel's P & E cores are in completely DIFFERENT architectures, supports DIFFERENT instructions and works in DIFFERENT ways in computing tasks.
And this is the origin of all the scheduling problems we saw since ADL, and the reason behind the AVX512 drama.
The scheduler cannot just simply treat them as "Faster and slower cores" when they are inherently different architectures.
And programmes tend to have problems trying to load multi threads into them simultaneously (Except a few programmes worked very hard in optimizing like Cinebench)
So the approach in Intel side of things is usually treat it as "Faster CPU and Slower CPU".
When you need something fast, slap it into P cores and P cores only
When it is a "Minor Task", slap it into E cores and E cores only
However, no one wants to be a "Minor Task", so everyone (programme) requested working in the P cores when they are loaded.
Then the OS forcefully picked what it think is "Minor" and slap it into E-cores.
Thus creating the situation of "P core working, E cores watching" and vice versa.
And sometimes creating compatibility problems when the scheduler loaded multi threads within a single programme into Intel's P&E cores simultaneously.
And it is tediously horrible in virtualization and the root cause why Intel's own SR Xeon CPUs are either "P cores only" or "E cores only" but not both.


So think twice when someone asking "Strix or not Strix ?" when looking at problems introduced by Intel P&E approach.
Those problems are mainly caused by the DIFFERENT architectures, not different speed.
Posted on Reply
#63
lexluthermiester
AssimilatorDo games themselves have to be optimised/aware of P- versus E-cores?
Seems clear that they do. Doesn't take much in the coding department to do so either.
Posted on Reply
#64
usiname
THU31Using the 13600K as an example, 8 E-cores offer ~60% more performance compared to 2 P-cores, while using ~40% more power. That's roughly a ~15% gain in efficiency, which is completely irrelevant on desktop.
That's why it actually makes no sense to have a 6P+8E SKU on desktop, when you could have an 8P+0E SKU offering very similar performance and power consumption.
8e cores take more space than 2P, in the space of 2P you can fit 6.5 or 7 e cores. 7e cores would be still faster than 2p cores in cinebench, but the main problem is tha lack of instructions and there are task which will work faster on the 2P cores. That is the whole problem with the e cores, because the e cores are not always faster.
Posted on Reply
#65
Lokiwoodplan
So is it worth it or not to choose cpu with e-core? If you must choose 13400f or 5700x which is the best choice?

Does e-core can be disable only specifically for one game for example at this atlas game?

Does intel 15th gen will still use e-core?
Posted on Reply
#66
Scrizz
Solaris17Middle of last year I rotated with a team of nothing but SDEs. They were having issues with performance on one of the new services we were spinning up. We were recording remote sessions and encoding them into video to be retrieved later.

They were not understanding why we were burning 192 core AMD systems and still getting poor performance. All of these guys were pretty removed from HW in general. I explained to them that we need to switch to our GPU compute cluster instead of using CPU threads since the GPUs can do HW En/Decode they were legit shocked.

We switched. Saved hundreds of thousands in internal costs took like 2 weeks for them to recode for GPUs. I got a promotion out of it. I rotated off the team not understanding how they made it that far.

Sometimes these guys literally just sit infront of a game engine and check a box "use all available CPU cores" I swear to god. I was on another team about 8 months later. I had to explain to a TAM (thankfully not an engineer) why 10gb/s links on our storage offload system did NOT mean 10 gigaBYTES/s and that the time quotes they were giving were going to be drastically off.

They got paid more than me.

Always shoot for the stars in your careers people. Even if you dont think you can cut it. The sky is already full of some pretty dim ones.
Yeah, it really is shocking how many people in tech and even computer tech(semiconductor/etc) have no idea about HW. Many times at work I facepalm internally. :laugh:
Posted on Reply
#67
W1zzard
persondbThe article doesn't describe if the E-core are disabled or if they used something like Process Affinity to limit the process to only use P-cores. If it's the former, then it's very possibly a ring bus issue where if E-cores are active, the clocks of the ring bus are forced to be considerably lower, thus lowering the performance of the P-cores.
The E-Cores were disabled through BIOS, I'll mention that in the article

Also @Battler624 for the same question
Posted on Reply
#68
Outback Bronze
W1zzardThe E-Cores were disabled through BIOS, I'll mention that in the article
Did you try disabling lots of 4 E-Cores?

Like 8P + 4E, 8P + 8E & 8P + 12E to see what that does or no point?
Posted on Reply
#69
W1zzard
CrackongExcept a few programmes worked very hard in optimizing like Cinebench
Do you have a source for that? Afaik Cinebench just splits the load, without being aware of anything, which is easy to do, especially if you have "just fast and slower cores". P-Cores will create more pixels, E-Cores fewer, but still contribute as much as they can
LokiwoodplanSo is it worth it or not to choose cpu with e-core? If you must choose 13400f or 5700x which is the best choice?

Does e-core can be disable only specifically for one game for example at this atlas game?

Does intel 15th gen will still use e-core?
13400F is slightly faster than 5700X for gaming (when not GPU limited, www.techpowerup.com/review/intel-core-i5-13400f/17.html)

But what you actually want is RPL with the larger cache (13600K in the same chart), it's not about the cores or the mhz, but about the cache.

Which is why 7800X3D is so good: www.techpowerup.com/review/amd-ryzen-7-7800x3d/19.html
Posted on Reply
#70
bug
W1zzardDo you have a source for that? Afaik Cinebench just splits the load, without being aware of anything, which is easy to do, especially if you have "just fast and slower cores". P-Cores will create more pixels, E-Cores fewer, but still contribute as much as they can
There's still work involved into splitting the load into chunks (they're not spinning off one task for each pixel, nor are are they spinning off a task for a whole screen/scene). And the work to wait for all tasks to finish to put a scene back together (synchronization) still exists, even if it's probably simpler than what happens in a game engine.
Though a game engine shouldn't that much different: a faster core can compute the updated path for 12 characters while a slower core will only handle 6, or compute the geometry for 100 objects while the other only computes for 50...

Atlas Fallen has some internal weirdness, I mean, it requires a 6600k for FHD@30fps on low settings.
Posted on Reply
#71
chrcoluk
Not sure if cinebench did any specific optimisations for intel's hybrid CPU's I still agree with w1zzard. That itself behaved wrong until I adjusted the CPU scheduler settings in the power schemes to prefer p cores. (no issue on win 11 due to its out of the box intel thread director).

Just the dev's of this game tried to be over clever.
Posted on Reply
#72
THU31
W1zzardBut what you actually want is RPL with the larger cache (13600K in the same chart), it's not about the cores or the mhz, but about the cache.
MHz is just as important as cache, though. The 13600K has a 24% higher clock speed. You wouldn't get 22% more performance just from the slight cache increase. That's why the 13600K is slightly faster than the 12900K. Its clock speed is 200 MHz higher and it has more L2 cache, but the L3 cache is 20% smaller.
It's also why the 7600 is faster than the 5800X3D on average (there are exceptions in very cache-sensitive games).

But in relation to the original question, a newer 6C/12T CPU is always better than an older 8C/16T for gaming. The 13400 has better IPC than the 5700X, that's why it's faster, not because of the 4 E-cores.

Fewer cores allows you to push the frequency higher, and that has always been the most important thing for gaming.


Fun fact - Destiny 2 has small hitches when you load between different zones while traversing the world. They were usually very noticeable on my 9700K. On my 13600K @ 3.3 GHz they are still there, but much smaller and less frequent. But at 5.1 GHz they never happen at all.
I'm playing at 4K60, which means that even if a faster CPU doesn't increase your framerate, it can still help with other things like stutters and hitches.
I expect the CPU-attached NVMe drive helps as well to some degree. On the 9700K it was going through the chipset.
Posted on Reply
#73
W1zzard
bugThere's still work involved into splitting the load into chunks (they're not spinning off one task for each pixel, nor are are they spinning off a task for a whole screen/scene).
They split it into blocks of pixels, but same thing.. this is trivial, it's just a few lines of code
bugAnd the work to wait for all tasks to finish to put a scene back together (synchronization) still exists, even if it's probably simpler than what happens in a game engine.
There is just one synchronization per run, so one after a few minutes, this isn't even worth calling "synchronization". I doubt that it submits the last chunk onto a faster core, if it's waiting for a slower core to finish that last piece
Posted on Reply
#74
Crackong
W1zzardDo you have a source for that?
No I don't.
Maybe I should re-iterate my sentense.
Except a few programmes worked very hard in optimizing like Cinebench
Except a few programmes which Intel themselves optimized their thread director very heavily on like Cinebench.
Posted on Reply
#75
W1zzard
CrackongExcept a few programmes which Intel themselves optimized their thread scheduler very heavily on like Cinebench.
Source? There's nothing to optimize for, Cinebench simply spawns one worker thread per core and runs work on it non-stop, until done

Edit: Thread Director helps the OS with the decision onto which core to schedule a particular thread
www.intel.com/content/www/us/en/gaming/resources/how-hybrid-design-works.html
Posted on Reply
Add your own comment
Nov 21st, 2024 06:33 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts