Monday, October 8th 2018

AMD Introduces Dynamic Local Mode for Threadripper: up to 47% Performance Gain

AMD has made a blog post describing an upcoming feature for their Threadripper processors called "Dynamic Local Mode", which should help a lot with gaming performance on AMD's latest flagship CPUs.
Threadripper uses four dies in a multi-chip package, of which only two have a direct access path to the memory modules. The other two dies have to rely on Infinity Fabric for all their memory accesses, which comes with a significant latency hit. Many compute-heavy applications can run their workloads in the CPU cache, or require only very little memory access; these are not affected. Other applications, especially games, spread their workload over multiple cores, some of which end up with higher memory latency than expected, which results in a suboptimal performance.

The concept of multiple processors having different memory access paths is called NUMA (Non-uniform memory access). While technically it is possible for software to detect the NUMA configuration and attach each thread to the ideal processor core, most applications are not NUMA aware and the adoption rate is very slow, probably due to the low number of systems using such a concept.
In ThreadRipper, using Ryzen Master, users are free to switch between "Local Memory Access" mode or "Distributed Memory Access" mode, with the latter being the default for ThreadRipper, resulting in highest compute application performance. Local Mode on the other hand is better suited to games, but switching between the modes requires a reboot, which is very inconvenient for users.

AMD's new "Dynamic Local Mode" seeks to abolish that requirement by introducing a background process that continually monitors all running applications for their CPU usage and pushes the more busy ones onto the cores that have direct memory access, by adjusting their process affinity mask, which selects which processors the application is allowed to be scheduled on. Applications that require very little CPU are in turn pushed onto the cores with no memory access, because they are not so important for fast execution.
This update will be available starting October 29 in Ryzen Master, and will be automatically enabled unless the user manually chooses to disable it. AMD also plans to open the feature up to even more users by including Dynamic Local Mode as a default package in the AMD Chipset Drivers.
Source: AMD Blog Post
Add your own comment

86 Comments on AMD Introduces Dynamic Local Mode for Threadripper: up to 47% Performance Gain

#51
Salty_sandwich
just a few questions
why would anyone buy a cpu with more than 8 cores if they just going to use the system for gaming?
I thought an 8 core CPU was the sweet spot in CPU's right now for gaming ? (probs due to consoles?)
do any games use or support more than 8 core cpu ?

I don't know so i'm asking.
Posted on Reply
#52
R-T-B
eidairaman1You have been here since 2006 and never filled out your specs and hardly post, yet you would on a Topic relating to AMD. Hmmm
I mean, he's also correct. There is no way that'd be permissable on WHQL.

Rate the point, not the person.
Posted on Reply
#53
R0H1T
Salty_sandwichjust a few questions
why would anyone buy a cpu with more than 8 cores if they just going to use the system for gaming?
I thought an 8 core CPU was the sweet spot in CPU's right now ? (probs due to consoles?)
do any games use or support more than 8 core cpu ?

I don't know so i'm asking.
No one's gonna buy these "just" for gaming, however it can be a good pastime. That's why an easy fix is nice, though it could've been easily achieved using other (popular) tools.
Posted on Reply
#55
R-T-B
qubitOk great, so when you talk about crippled dies, do you mean disabled dies to make a lower end processor? If so, why is that a bad thing? It just means that they can still sell lower end products through binning.
He means the fact that two dies do not have a direct connection to the memory controller, and must talk via another CCX via proxy.

Honestly, it IS an odd design. But it's not useless.
Posted on Reply
#56
qubit
Overclocked quantum bit
efikkan(facepalm)
No, not at all. Where do you get this from? Two dies on 2970WX/2990WX have to go through the Infinity Fabric to access memory, which causes significant latency. Many workloads are latency sensitive, and this only gets worse when using multiple applications at once. AMD could have made Threadripper without these limitations, but perhaps not on this socket.
I thought this was a tech forum…
Do you really know enough about the Threadripper and EPYC designs to know why they designed them the way they did? You'd have to know them pretty well to make a criticism like that and I suspect you don't. I certainly don't know enough about it.

If I really wanted to answer that question, I'd read up all the articles and official AMD docs that I could get my hands on to really understand it. I'm sure the answer is there, but it would take some time and effort to do. I'm not willing to though, because the issue isn't important enough for me. Again, do you really know the ins and outs of these designs to criticise AMD for their design choices?

If you say yes, then I'll expect you to back that up with hard evidence before I consider your argument credible.
R-T-BHe means the fact that two dies do not have a direct connection to the memory controller, and must talk via another CCX via proxy.

Honestly, it IS an odd design. But it's not useless.
Thanks for the clarification rtb. Only saw your post after I'd posted. :)
Posted on Reply
#57
R-T-B
Yes, he's correct. It has to do with some limitations on threadrippers socket. They are already hitting it's CCX limit. Only answer is to bump per-CCX count.
Posted on Reply
#58
W1zzard
qubitDo you really know enough about the Threadripper and EPYC designs to know why they designed them the way they did?
TR 2nd gen is designed like that for socket compatibility and to avoid user complexity of 8-channel memory
Posted on Reply
#59
qubit
Overclocked quantum bit
W1zzardTR 2nd gen is designed like that for socket compatibility and to avoid user complexity of 8-channel memory
There, I knew there was a good reason. I figured that AMD somehow aren't so stupid. Thanks W1z.
Posted on Reply
#60
efikkan
qubitDo you really know enough about the Threadripper and EPYC designs to know why they designed them the way they did? You'd have to know them pretty well to make a criticism like that and I suspect you don't. I certainly don't know enough about it.
You can have all the opinions you want, I only care about the facts.

Those familiar with the core layout should know how these CPUs work, and the sacrifices AMD have made by letting two dies not having direct access to memory.


Due to these limitations, 2990WX(32-core) goes from performing very well in some tasks to performing badly, sometimes even worse than the 2950X(16-core), if the task is not ideal. One example. The 2950X(16-core) performs as expected and scales fairly well, while the 2990WX(32-core) is really a mixed bag.

The sad thing is that 2990WX(32-core) would have been a much better product if the dies were either balanced with one memory channel each, or the full 8 channels.
Posted on Reply
#61
qubit
Overclocked quantum bit
efikkanYou can have all the opinions you want, I only care about the facts.

Those familiar with the core layout should know how these CPUs work, and the sacrifices AMD have made by letting two dies not having direct access to memory.


Due to these limitations, 2990WX(32-core) goes from performing very well in some tasks to performing badly, sometimes even worse than the 2950X(16-core), if the task is not ideal. One example. The 2950X(16-core) performs as expected and scales fairly well, while the 2990WX(32-core) is really a mixed bag.

The sad thing is that 2990WX(32-core) would have been a much better product if the dies were either balanced with one memory channel each, or the full 8 channels.
Ok, it's a limitation, but have you seen W1zard's reply to me a couple of posts up. That should clarify to you why these limitations exist.
Can you finally see that it's not a flawed design as you put it, but one that's working within cost and compatibility constraints?
Posted on Reply
#62
efikkan
qubitOk, it's a limitation, but have you seen W1zard's reply to me a couple of posts up. That should clarify to you why these limitations exist.

Can you finally see that it's not a flawed design as you put it, but one that's working within cost and compatibility constraints?
So you're down to semantics?
When you have a 2990WX(32-core) which is basically a double 2950X(16-core), and it performs as expected in a wide range of controlled benchmarks, but suddenly performs worse than the 16-core due the core configuration having severe limitations in design. Any engineer would call that a design flaw. It's not a bug, but a principal mistake in the design, an oversight since they didn't foresee this type of configuration early enough. It ruins what would otherwise have been a much better product.
Posted on Reply
#63
Captain_Tom
efikkanSo you're down to semantics?
When you have a 2990WX(32-core) which is basically a double 2950X(16-core), and it performs as expected in a wide range of controlled benchmarks, but suddenly performs worse than the 16-core due the core configuration having severe limitations in design. Any engineer would call that a design flaw. It's not a bug, but a principal mistake in the design, an oversight since they didn't foresee this type of configuration early enough. It ruins what would otherwise have been a much better product.
That's like calling Vega's gaming performance a "design flaw" because it doesn't game as well as it renders. But it's not a flaw, it's a premeditated sacrifice they knew ahead of release. Get over yourself lol.

Oh, and I am an engineer - that's not a design flaw. It costs $1800, not $5,000+.
Posted on Reply
#64
Octavean
eidairaman1Send AMD An email to the developers of ryzen master, they would know
Sure,.....

No argument there. However this report here at Techpowerup could have made this a bit more clear. I'm still not 100% sure but this other reference seemed to make it a bit clearer:
The more surprising announcement comes in the form of a new software feature for the Threadripper WX-series processors called "Dynamic Local Mode" which aims to address some of the performance issues caused by the non-traditional memory structure of these processors, where not all CPU cores have direct access to a memory controller.

According to the blog post on AMD's website, Dynamic Local Mode will run as a Windows 10 service and measure how much CPU time each thread is utilizing.

This service will then begin to reallocate these demanding threads to the CPU cores which have direct memory access until it runs out of available cores. In that case, the service will start to assign threads to the remaining cores.

This dynamic operation ensures for applications that aren't consuming all 48/64 threads on the WX-series processors, that direct memory access will be available when needed. In particular, this should provide an advantage to gaming, which typically takes up less than eight cores, but is dependant on fast memory access.
www.pcper.com/news/Processors/AMD-Announces-Threadripper-2970WX-2920X-Availability-New-Dynamic-Local-Mode-Feature
Posted on Reply
#65
qubit
Overclocked quantum bit
efikkanSo you're down to semantics?
When you have a 2990WX(32-core) which is basically a double 2950X(16-core), and it performs as expected in a wide range of controlled benchmarks, but suddenly performs worse than the 16-core due the core configuration having severe limitations in design. Any engineer would call that a design flaw. It's not a bug, but a principal mistake in the design, an oversight since they didn't foresee this type of configuration early enough. It ruins what would otherwise have been a much better product.
I see my clear and reasonable explanation didn't help. Oh well, I tried.
Captain_TomThat's like calling Vega's gaming performance a "design flaw" because it doesn't game as well as it renders. But it's not a flaw, it's a premeditated sacrifice they knew ahead of release. Get over yourself lol.

Oh, and I am an engineer - that's not a design flaw. It costs $1800, not $5,000+.
^^This.

Funny how AMD has a much more expensive solution without these limitations, innit? ;)
Posted on Reply
#66
HTC
Does this mean now Windows performance is more inline with Linux performance in applications?

For example, 7-Zip.
Posted on Reply
#67
cucker tarlson
That +47% really means huge improvement in multi threaded games. The other results around 10% are in games that prioritize single core perfromance like FC5, 10% is still neat.
Posted on Reply
#68
R0H1T
HTCDoes this mean now Windows performance is more inline with Linux performance in applications?

For example, 7-Zip.
No, inherently the Windows scheduler is pretty bad compared to Linux especially for high core count CPUs.
Posted on Reply
#69
HTC
R0H1TNo, inherently the Windows scheduler is pretty bad compared to Linux especially for high core count CPUs.
And doesn't this software help mitigate / eliminate that?
Posted on Reply
#70
R0H1T
HTCAnd doesn't this software help mitigate / eliminate that?
Somewhat, but by the looks of it this seems an inferior version of Process Lasso. I can't be 100% certain, but that's what I gauged from this ~
Dynamic Local Mode will run as a Windows 10 service and measure how much CPU time each thread is utilizing.

This service will then begin to reallocate these demanding threads to the CPU cores which have direct memory access until it runs out of available cores. In that case, the service will start to assign threads to the remaining cores.

This dynamic operation ensures for applications that aren't consuming all 48/64 threads on the WX-series processors, that direct memory access will be available when needed. In particular, this should provide an advantage to gaming, which typically takes up less than eight cores, but is dependant on fast memory access.
According to the blog post on AMD's website

I think in hindsight AMD might've preferred to wait for Zen2/3 for their 32 core TR monster, but I'm sure they had their reasons & it's not like TR2 is bad.
Posted on Reply
#71
John Naylor
cucker tarlsonThat +47% really means huge improvement in multi threaded games. The other results around 10% are in games that prioritize single core perfromance like FC5, 10% is still neat.
I'm anxious to see what "+47% improvement" actially means when averaged over TPUs Gaming Test Suite. I expect single digits. Not thet single dogits is nothing... all improvements welcome. OTOH I'm still in the "it ain't real till we see those results"camp.
Posted on Reply
#72
R0H1T
John NaylorI'm anxious to see what "+47% improvement" actially means when averaged over TPUs Gaming Test Suite. I expect single digits. Not thet single dogits is nothing... all improvements welcome. OTOH I'm still in the "it ain't real till we see those results"camp.
It means +47% over regular 2990WX numbers, without dynamic local mode enabled, tested internally at AMD.
If you're looking at different websites, like TPU, then they'll give different numbers depending on the combination of hardware & software used, including OS like linux.
Posted on Reply
#73
HTC
R0H1TIt means up to 47% over regular 2990WX numbers, without dynamic local mode enabled, tested internally at AMD.
Slight correction there: for all we know, only one applications sees this kind of boost but it wouldn't surprise me if the uplift were ... say ... 15% to 20% in general applications that are hit by the latency problems associated with cores not having direct memory access.
Posted on Reply
#74
R0H1T
HTCSlight correction there: for all we know, only one applications sees this kind of boost but it wouldn't surprise me if the uplift were ... say ... 15% to 20% in general applications.
Yeah the regular qualifier "up to" applies here too, although 47% may not be the upper limit on say linux. Generally speaking there aren't too many applications that can make (good) use of all 32 cores & not be hampered by 4 channel memory.
Posted on Reply
#75
HTC
R0H1TYeah the regular qualifier "up to" applies here too, although 47% may not be the upper limit on say linux.
Unknown @ this point since this software seems to be Windows specific. Were a version of it made for Linux, perhaps ... but i wouldn't bet on it because of how much better optimized Linux scheduler is VS Windows': not much gains to be had, i'm guessing.
Posted on Reply
Add your own comment
Jul 17th, 2024 05:35 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts