Thursday, August 10th 2023

Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

Action RPG "Atlas Fallen" joins a long line of RPGs this Summer for you to grind into—Baldur's Gate 3, Diablo 4, and Starfield. We've been testing the game for our GPU performance article, and found something interesting—the game isn't optimized for Intel Hybrid processors, such as the Core i9-13900K "Raptor Lake" in our bench. The game scales across all CPU cores—which is normally a good thing—until we realize that not only does it saturate all of the 8 P-cores, but also the 16 E-cores. It ends up with under 80 FPS in busy gameplay at 1080p with a GeForce RTX 4090. Performance is "restored" only when the E-cores are disabled.

Normally, when a game saturates all of the E-cores, we don't interpret it as the game being "aware" of E-cores, but rather "unaware" of them. An ideal Hybrid-aware game should saturate the P-cores for its main workload, and use the E-cores for errands such as processing the audio stack (DSPs from the game), network stack (the game's unique multiplayer network component), physics, in-flight decompression of assets from the disk, etc., which show up in Task Manager as intermittent, irregular load. "Atlas Fallen" appears to be using the E-cores for its main worker threads, and this is found imposing a performance penalty as we found out by disabling the E-cores. This performance penalty is because the E-cores run slower than P-cores, at lower clock speeds, have much lower IPC, and are cache-starved. Frame data being processed by the P-cores end up having to wait for those from the E-cores, which causes the overall framerate to come down.
In the Task Manager screenshot above, the game is running in the foreground, we set Task Manager to be "always on top," so Thread Director won't interfere with the game. It prefers to allocate the P-cores to foreground tasks, which doesn't happen here, because the developers chose to specifically put work on the E-Cores.

For comparison we took four screenshots, with E-Cores enabled and disabled (through BIOS). We picked a "typical average" scene instead of a worst case, which is why the FPS are a bit higher. As you can see, with E-Cores enabled are pretty low (136 / 152 FPS), whereas turning off the E-Cores instantly increases performance right up to the engine's internal FPS cap (187 / 197 FPS).

With the E-cores disabled, the game is confined to what is essentially an 8-core/16-thread processor with just P-cores, which boost well above the 5.00 GHz mark, and have the full 36 MB slab of L3 cache to themselves. The framerate now shoots up to 200 FPS, which is a hard framerate limit set by the developer. Our RTX 4090 should be capable of higher framerates, and developers Deck13 Interactive should consider raising it, given that monitor refresh-rates are on the rise, and it's fairly easy to find a 240 Hz or 360 Hz monitor in the high-end segment. The game is based on the Fledge engine, and supports both DirectX 12 and Vulkan APIs. We used GeForce 536.99 WHQL in our testing. Be sure to check out our full performance review of Atlas Fallen later today.
Add your own comment

120 Comments on Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

#26
Squared
bugIt definitely has a different kind of workload. But it still doesn't make sense to reduce the overall computing power available and see the performance go up.
It does if each thread only does a miniscule amount of work before having to communicate with the other threads. If I tell eight threads to add 1+1, then send the results to one thread, it'll take longer than just having one thread add 1+1 eight times. And if I tell an E-core to calculate pi to a million digits and send the last digit to a P-core so that it can add that digit to 1, then it'll take way, way longer than just having the P core do all the work. (And normally the programmer can't say which core will run a thread, Windows and Thread Director decide that.)

So for parallel programming, you only spawn new threads when you have large-ish chunks of work for each thread, otherwise you'll risk taking longer than just using one thread.
Posted on Reply
#27
AnotherReader
atomekbig.LITTLE architecture in desktop CPUs is completely retarded idea by Intel. It came as a response to ARM efficiency which is (and will be) out of reach for Intel or x86 arch in general. It makes some sense in laptops, to designate E cores for background tasks and save battery, but it is not something that can help catch-up with ARM in terms of efficiency (this is impossible due to architecture limitations of CISC - and if you respond that "Intel is RISC internally" - it doesn't matter. The problem is with non-fixed length instruction which makes optimisation of branch predictor miserable). Funny part is that AMD without P/E is way more efficient than Intel (but 5-7 times less efficient than ARM, especially Apple implementation of this ISA)
Those architectural penalties only apply to simpler, in-order designs like the first Atom. With large out-of-order processors, the x86 penalty is irrelevant as the costs of implementing a large out-of-order core far outweigh the complexity of the x86 decoders.
Posted on Reply
#28
720p low
The question I have regarding this story is, did no one involved in the development, playtesting or quality assurance phase of this game use an Intel CPU with P&E cores? Really?
Posted on Reply
#29
Squared
For the record, Intel's E-core are not power-efficient, they're area-efficient (cheap). I suspect this is at least partially true of most ARM E-cores as well. It just sounds better to the customer if you tell the customer that some cores are "efficient" instead of "cheap". But the reality is that cheap cores mean more cores, which is also why AMD uses chiplets. (AMD's "cloud" cores are power-efficient, so certainly I think some E-cores are power-efficient. But I think the first goal was area-efficiency.)

Apple's ARM processors are efficient, but probably more because of TSMC's N5 node than because of ARM or E-cores.
Posted on Reply
#30
atomek
AnotherReaderThose architectural penalties only apply to simpler, in-order designs like the first Atom. With large out-of-order processors, the x86 penalty is irrelevant as the costs of implementing a large out-of-order core far outweigh the complexity of the x86 decoders.
There is huge penalty, Apple Silicon is large ooo processor, and this is where ARM architecture shines - it has 630+ deep ROB (AMD 256), this allow to achieve incredible high instruction-level parallelism, which will be never in reach for X86. Having high ILP also means that these instructions need to be executed in parallel, and here we also back-end execution engines feature extremely wide capabilities. Intel was horrified after launch of M1, they had to "came up with something". So they forced themselves into little.BIG - which is pointless but maybe prolongs death of X86 a little bit in laptop market but is ridiculous in desktop CPUs.
For the record, Intel's E-core are not power-efficient, they're area-efficient (cheap).
Yes, I'd say even they are power throttled cores with aim to take background tasks and save battery live on laptops. They are not more "efficient" in any way, they are just less performant.
Apple's ARM processors are efficient, but probably more because of TSMC's N5 node than because of ARM or E-cores.
Not really, if you try to estimate how much processes it will take to match efficiency of Apple Silicon, it is at least 5 node gap.
Posted on Reply
#31
Frick
Fishfaced Nincompoop
W1zzardpossibly also "lack of QA testing"
QA is done by consumers, so it's fine.
Posted on Reply
#32
mb194dc
QA Testing, that's for the users in the first few months after launch?

Pretty sure Microsoft started doing that and it's pretty much standard now.
Posted on Reply
#33
JoeTheDestroyer
bugEven so, the behavior is still strange. I mean, Cinebench also puts load on all cores, but still runs faster when also employing the E-cores. There's something fishy in that code, beyond the sloppy scheduling.
Notice how Cinebench divides it's workload into hundreds of chunks, then parcels those out to each thread as needed. A P-core will finish several of these chunks in the time an E-core can only finish one, but they do all contribute as much as possible to getting the whole task done.

Now imagine what would happen if there were far fewer chunks, specifically exactly as many chunks as there are threads. Each P-core would rapidly finish its single chunk then sit waiting while the E-cores finish theirs.

This is very likely the problem with this game, the developers naively assumed each thread was equally capable and thus divided the work exactly equally between them. Thus, while they gain some performance from the parallelization, they lose much more performance by not utilizing the (far more performant) P-cores to their fullest.
Posted on Reply
#34
AnotherReader
atomekThere is huge penalty, Apple Silicon is large ooo processor, and this is where ARM architecture shines - it has 630+ deep ROB (AMD 256), this allow to achieve incredible high instruction-level parallelism, which will be never in reach for X86. Having high ILP also means that these instructions need to be executed in parallel, and here we also back-end execution engines feature extremely wide capabilities. Intel was horrified after launch of M1, they had to "came up with something". So they forced themselves into little.BIG - which is pointless but maybe prolongs death of X86 a little bit in laptop market but is ridiculous in desktop CPUs.



Yes, I'd say even they are power throttled cores with aim to take background tasks and save battery live on laptops. They are not more "efficient" in any way, they are just less performant.




Not really, if you try to estimate how much processes it will take to match efficiency of Apple Silicon, it is at least 5 node gap.
You're comparing a CPU that clocks at 3.5 GHz to one that clocks close to 6 GHz. Obviously, the lower clocked CPU would be able to afford bigger structures due to relaxed timings. Zen 4c proves that there's nothing magical about ARM. With some changes in physical design, Zen 4c achieves the same IPC as Zen 4 while being half the size. Apple's designs are very impressive, but that is a testament to Apple's CPU design teams. Note that no other ARM designs come close. You're also mistaken about the sizes of the various out-of-order structures in recent x86 processors.



Zen 4
Zen 3Golden CoveComments
Reorder Buffer320256512Each entry on Zen 4 can hold 4 NOPs. Actual capacity confirmed using a mix of instructions


This table is from part 1 of the Chips and Cheese overview of Zen 4. Notice that Golden Cove, despite the handicaps of higher clock speed and an inferior process, has a ROB size that is much closer to Apple's M2 than Zen 4.
Posted on Reply
#37
AnotherReader
SquaredFor the record, Intel's E-core are not power-efficient, they're area-efficient (cheap). I suspect this is at least partially true of most ARM E-cores as well. It just sounds better to the customer if you tell the customer that some cores are "efficient" instead of "cheap". But the reality is that cheap cores mean more cores, which is also why AMD uses chiplets. (AMD's "cloud" cores are power-efficient, so certainly I think some E-cores are power-efficient. But I think the first goal was area-efficiency.)

Apple's ARM processors are efficient, but probably more because of TSMC's N5 node than because of ARM or E-cores.
Apple's E cores are much better than ARM's little cores.

Image is edited from the one in the Anandtech article linked above. Notice that the A15's Blizzard E cores are 5 times faster than the A55 in the gcc subtest but consume only 58% more power, making them 3.25x more efficient in terms of performance per Watt. Even the A14's E cores, which consume almost the same power as the A55 in this subtest are 3.75 times faster.
Posted on Reply
#38
atomek
AnotherReaderYou're comparing a CPU that clocks at 3.5 GHz to one that clocks close to 6 GHz. Obviously, the lower clocked CPU would be able to afford bigger structures due to relaxed timings.
Higher clocks of Intel CPU only shows they are reaching performance wall, they can't scale with architecture, only with clocks and node shrinks. In last 7 years single core performance of Intel improved 30%, for Apple it was 200%. X86 is not scaling anymore, clock speed alone doesn't matter. Now we have situation where for the same computing power Intel takes 3x more energy than ARM (arstechnica.com/gadgets/2022/03/mac-studio-review-a-nearly-perfect-workhorse-mac/3/) . This is 3 times more - to put things in perspective, one node shrink gives about 15-20% of power reduction - so if we would be optimistic, it will take 6 node shrinks for Intel to catch-up with M1 efficiency.
Posted on Reply
#39
AnotherReader
atomekHigher clocks of Intel CPU only shows they are reaching performance wall, they can't scale with architecture, only with clocks and node shrinks. In last 7 years single core performance of Intel improved 30%, for Apple it was 200%. X86 is not scaling anymore, clock speed alone doesn't matter. Now we have situation where for the same computing power Intel takes 3x more energy than ARM (arstechnica.com/gadgets/2022/03/mac-studio-review-a-nearly-perfect-workhorse-mac/3/) . This is 3 times more - to put things in perspective, one node shrink gives about 15-20% of power reduction - so if we would be optimistic, it will take 6 node shrinks for Intel to catch-up with M1 efficiency.
One node shrink doesn't give such low efficiency. It's typically 30 to 40% power reduction at the same performance. Intel's designs are pushed to stupid clocks for single threaded bragging and would be much more efficient if clocked to more reasonable levels. We also have the example of AMD's laptop silicon and that's much closer to the M2 or M1 than you might realize.
Posted on Reply
#40
cvaldes
Another PC gaming title that didn't go through QA.

Pity.
Posted on Reply
#41
R0H1T
So Strix or not to Strix that is the question o_O
AnotherReader
Image is edited from the one in the Anandtech article linked above. Notice that the A15's Blizzard E cores are 5 times faster than the A55 in the gcc subtest but consume only 58% more power, making them 3.25x more efficient in terms of performance per Watt. Even the A14's E cores, which consume almost the same power as the A55 in this subtest are 3.75 times faster.
The results are not validated, so it could be even better.
Posted on Reply
#42
AnotherReader
R0H1TSo Strix or not to Strix that is the question o_O


The results are not validated, so it could be even better.
I think Anandtech's testing makes sense. Rather than spending time tweaking the compiler to get the highest score for each CPU, they choose reasonable common options, and see how the CPUs fare.
Posted on Reply
#43
R0H1T
Generally speaking even if AMD/Intel match or blow past Apple's efficiency in a few years Apple will still have a massive advantage with them controlling the entire ecosystem from Software, hardware & to a smaller extent having the major upside of using such a wide LPDDR5 bus. Which is to say that I don't believe when you take the whole picture into account x86 can win on the consumer front, at least short to medium term.
Posted on Reply
#44
AnotherReader
R0H1TGenerally speaking even if AMD/Intel match or blow past Apple's efficiency in a few years Apple will still have a massive advantage with them controlling the entire ecosystem from Software, hardware & to a smaller extent having the major upside of using such a wide LPDDR5 bus. Which is to say that I don't believe when you take the whole picture into account x86 can win on the consumer front, at least short to medium term.
Yes, Apple gets to design the entire system, and they are focused on efficiency which is the right metric for mobile use cases. I expect AMD or Intel to come close, but Apple will continue to lead. It also helps that they move to new nodes before AMD, and Intel is still well behind TSMC.
Posted on Reply
#45
W1zzard
bugCinebench
Cinebench calculates something that's extremely easy to parallelize, because the pixels rendered don't depend on each other. So you can just run a piece of code on one CPU and it's guaranteed that you never have to wait for a result from another core. Cinebench also has a tiny working set that fits into the cache of all modern CPUs, so you're even guaranteed that you don't have to wait on data from DRAM.

That's exactly why certain companies like to use it to show how awesome their product is, because it basically scales infinitely and doesn't rely on the memory subsystem or anything else.

Gaming is pretty much the opposite. To calculate a single frame you need geometry, rendering, physics, sound, AI, world properties and many more, and they are all synchronized (usually), and have to wait on each other, for every single frame. Put the slowest of these workloads on the E-Cores, everything else has to wait. This will not show up in Task Manager on the waiting cores, because the game is doing a busy wait, to reduce latency, at the cost of not freeing up the CPU core to do something else.
Posted on Reply
#46
Garrus
phanbueyI can count on one hand how many times i've heard or experienced anything like this since Alder lake. As W1z wrote above, the developers needed to do nothing... they did something silly and we got this.
Actually Alder Lake was a disaster for months after launch. I had lots of problems. It was all fixed up, but E cores are still causing some issues.

I switched to the Ryzen 7800X3D and couldn't be happier.
Posted on Reply
#47
persondb
AssimilatorDo games themselves have to be optimised/aware of P- versus E-cores? I was under the impression that Intel Thread Director + the Win11 scheduling was sufficient for this, but I guess if there's a bug in either of those components it would also manifest in this regard.
The article doesn't describe if the E-core are disabled or if they used something like Process Affinity to limit the process to only use P-cores. If it's the former, then it's very possibly a ring bus issue where if E-cores are active, the clocks of the ring bus are forced to be considerably lower, thus lowering the performance of the P-cores.
Posted on Reply
#48
phanbuey
GarrusActually Alder Lake was a disaster for months after launch. I had lots of problems. It was all fixed up, but E cores are still causing some issues.

I switched to the Ryzen 7800X3D and couldn't be happier.
I got the 12600K as soon as it was available -- what problems did you have?

I heard of some Deneuvo issues but didn't experience them myself. I was honestly expecting problems, but didn't have anything.

I was thinking of going 7950x3d (i need the cores for VM data schlepping) but it honestly didn't feel worth and 7800x3d is awesome but would be a downgrade for work but a small upgrade for gaming at 4k :/

I do have the itch build another personal AMD rig at some point.
R0H1TGenerally speaking even if AMD/Intel match or blow past Apple's efficiency in a few years Apple will still have a massive advantage with them controlling the entire ecosystem from Software, hardware & to a smaller extent having the major upside of using such a wide LPDDR5 bus. Which is to say that I don't believe when you take the whole picture into account x86 can win on the consumer front, at least short to medium term.
Apple problem is the software. It's too expensive for cloud stuff, and not enterprise friendly enough for corporate stuff, and doesn't really do any gaming. Even if they have amazing hardware (which they've had for years) - there's only so many hipsters at starbucks who code frontend / do marketing and gfx work and I think they already all use apple.

If they opened up to run games and were better for enterprise (better AD/MDM/MDS integration, and better compatibility with microsoft apps) they would crush it on the consumer side.
Posted on Reply
#49
chrcoluk
W1zzardGames don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
That was my thoughts, thanks for confirming, since I adjusted my Windows 10 scheduler to prefer p-cores I havent seen a single game use my e-cores, so I was thinking what you have confirmed, that these dev's did something different.
Posted on Reply
#50
Squared
phanbueyIf they opened up to run games and were better for enterprise (better AD/MDM/MDS integration, and better compatibility with microsoft apps) they would crush it on the consumer side.
Or if Apple made Windows and Linux drivers.
Posted on Reply
Add your own comment
Dec 3rd, 2024 13:55 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts