Tuesday, August 13th 2019
110°C Hotspot Temps "Expected and Within Spec", AMD on RX 5700-Series Thermals
AMD this Monday in a blog post demystified the boosting algorithm and thermal management of its new Radeon RX 5700 series "Navi" graphics cards. These cards are beginning to be available in custom-designs by AMD's board partners, but were only available as reference-design cards for over a month since their 7th July launch. The thermal management of these cards spooked many early adopters accustomed to seeing temperatures below 85 °C on competing NVIDIA graphics cards, with the Radeon RX 5700 XT posting GPU "hotspot" temperatures well above 100 °C, regularly hitting 110 °C, and sometimes even touching 113 °C with stress-testing application such as Furmark. In its blog post, AMD stated that 110 °C hotspot temperatures under "typical gaming usage" are "expected and within spec."
AMD also elaborated on what constitutes "GPU Hotspot" aka "junction temperature." Apparently, the "Navi 10" GPU is peppered with an array of temperature sensors spread across the die at different physical locations. The maximum temperature reported by any of those sensors becomes the Hotspot. In that sense, Hotspot isn't a fixed location in the GPU. Legacy "GPU temperature" measurements on past generations of AMD GPUs relied on a thermal diode at a fixed location on the GPU die which AMD predicted would become the hottest under load. Over the generations, and starting with "Polaris" and "Vega," AMD leaned toward an approach of picking the hottest temperature value from a network of diodes spread across the GPU, and reporting it as the Hotspot.On Hotspot, AMD writes: "Paired with this array of sensors is the ability to identify the 'hotspot' across the GPU die. Instead of setting a conservative, 'worst case' throttling temperature for the entire die, the Radeon RX 5700 series GPUs will continue to opportunistically and aggressively ramp clocks until any one of the many available sensors hits the 'hotspot' or 'Junction' temperature of 110 degrees Celsius. Operating at up to 110C Junction Temperature during typical gaming usage is expected and within spec. This enables the Radeon RX 5700 series GPUs to offer much higher performance and clocks out of the box, while maintaining acoustic and reliability targets."
AMD also commented on the significantly increased granularity of clock-speeds that improves the GPU's power-management. The company transisioned from fixed DPM states to a highly fine-grained clock-speed management system that takes into account load, temperatures, and power to push out the highest possible clock-speeds for each component. "Starting with the AMD Radeon VII, and further optimized and refined with the Radeon RX 5700 series GPUs, AMD has implemented a much more granular 'fine grain DPM' mechanism vs. the fixed, discrete DPM states on previous Radeon RX GPUs. Instead of the small number of fixed DPM states, the Radeon RX 5700 series GPU have hundreds of Vf 'states' between the bookends of the idle clock and the theoretical 'Fmax' frequency defined for each GPU SKU. This more granular and responsive approach to managing GPU Vf states is further paired with a more sophisticated Adaptive Voltage Frequency Scaling (AVFS) architecture on the Radeon RX 5700 series GPUs," the blog post reads.
Source:
AMD
AMD also elaborated on what constitutes "GPU Hotspot" aka "junction temperature." Apparently, the "Navi 10" GPU is peppered with an array of temperature sensors spread across the die at different physical locations. The maximum temperature reported by any of those sensors becomes the Hotspot. In that sense, Hotspot isn't a fixed location in the GPU. Legacy "GPU temperature" measurements on past generations of AMD GPUs relied on a thermal diode at a fixed location on the GPU die which AMD predicted would become the hottest under load. Over the generations, and starting with "Polaris" and "Vega," AMD leaned toward an approach of picking the hottest temperature value from a network of diodes spread across the GPU, and reporting it as the Hotspot.On Hotspot, AMD writes: "Paired with this array of sensors is the ability to identify the 'hotspot' across the GPU die. Instead of setting a conservative, 'worst case' throttling temperature for the entire die, the Radeon RX 5700 series GPUs will continue to opportunistically and aggressively ramp clocks until any one of the many available sensors hits the 'hotspot' or 'Junction' temperature of 110 degrees Celsius. Operating at up to 110C Junction Temperature during typical gaming usage is expected and within spec. This enables the Radeon RX 5700 series GPUs to offer much higher performance and clocks out of the box, while maintaining acoustic and reliability targets."
AMD also commented on the significantly increased granularity of clock-speeds that improves the GPU's power-management. The company transisioned from fixed DPM states to a highly fine-grained clock-speed management system that takes into account load, temperatures, and power to push out the highest possible clock-speeds for each component. "Starting with the AMD Radeon VII, and further optimized and refined with the Radeon RX 5700 series GPUs, AMD has implemented a much more granular 'fine grain DPM' mechanism vs. the fixed, discrete DPM states on previous Radeon RX GPUs. Instead of the small number of fixed DPM states, the Radeon RX 5700 series GPU have hundreds of Vf 'states' between the bookends of the idle clock and the theoretical 'Fmax' frequency defined for each GPU SKU. This more granular and responsive approach to managing GPU Vf states is further paired with a more sophisticated Adaptive Voltage Frequency Scaling (AVFS) architecture on the Radeon RX 5700 series GPUs," the blog post reads.
159 Comments on 110°C Hotspot Temps "Expected and Within Spec", AMD on RX 5700-Series Thermals
Grow up
And yes, AIB cards are not horrible, if you care to read back I just about repeated that every other post. That is the whole god damn point. :roll:
Point is AMD uses a different set of sensors and ways to measure temperatures, this can't directly translate into "Nvidia cards run cooler" nor does it mean that this must make them better products. That's the memo.
Simple case of connected dots here... if you feel confident this 110C is a guarantee for longevity, power to you. I don't.
I might be a stubborn idiot but this is clear as day, sorry.
A card with a more than decent cooler that still reports these "hella scary" temperatures. It's not a guarantee for anything because I don't have a bloody clue what that 110C figure is supposed to tell me. I am trying really hard to understand how is it that you people are so convinced that these numbers have some negative implication when in reality you have absolutely no reference point. You simply insist to believe AMD is doing something wrong with no proof.
The Sapphire Pulse model is an astonishingly 2% faster than reference, all this talk about how crappy AMD's cooler and temperatures are would have led me to believe things would have been a lot more different.
At the same time this only confirms my idea that AMD pushed Navi out of the box right up into the danger zone and slapped a blower on top for good measure. Its OC'd out of the box, practically, without a cooler to match. Ah my shining beacon of wisdom and clarity. Thank you.
im waiting for a hdmi 2.1 cards that come out and dont run 100C :) I guess i dont play games very often and only recently upgraded from i7 3930k from 8 years ago. We all choose to spend our money different ways. im not a big eat out / fast food kinda guy, id rather buy the Tbone for 12$ and cook it myself then pay 120 for it cooked already.
The fact is all the recent reviews shows that the Sapphire Pulse barely out performs the Reference Card.
Any for the overclock results, the Reference Card's gpu actually overclocked better than the Sapphire Pulse on W1zzard's sample.
Let me remind you the official given "game clock" is 1755Mhz, so the card ran below 1900Mhz is throttling is just not true.
How do you explain this?
www.techpowerup.com/review/sapphire-radeon-rx-5700-xt-pulse/34.html
It is not just TPU reviews, even GN's reviews shows that the non-reference card performs almost the same as the reference design.
So it takes more than just "cooler card must be better, hotter card must be running out of spec and losing performance" to prove it.
It is all speculation and GN's own opinion on what is too hot, while even his own data cannot prove the Reference card is losing significant clock speed or performance.
A device has a max rated limit. This is the max it can take before IMMEDIATE damage occurs. Long term damage does not play by the same rule. Whenever you are dealing with a physical product, you NEVER push it to 100% limit constantly and expect it to last. This applies to air conditioners, jacks, trucks, computers, tables, fans, anything you use on a daily basis. Like I said, my car can do 155 MPH. But if I were to push it that fast constantly, every day, the car wouldnt last very long before experiencing mechanical issues, because it isnt designed to SUSTAIN that speed.
Every time the GPU heats up and cools down, the solder connectors experience expansion and contraction. Over time, this can result in the solder connections cracking internally, resulting in a card that does not work properly. The greater the temperature variance, the faster this occurs. This is why many GPUs now shut the fans off under 50C, because cooling it all the way down to 30C increases the variance the GPU experiences.
What AMD is doing here is allowing the GPU to run at max tjunct temp for extended periods of time and calling this acceptable. Given the GPU also THROTTLES at this temp, AMD is admitting it designed a GPU that cant run at full speed during typical gaming workloads. Given AMD also releases GPUs that can be tweaked to both run faster and consume less voltage rather reliably, it would seem a LOT of us know better then RTG engineers.
Would you care to explain how AMD's silicon is magically no longer affected by physical expansion and contraction from temperatures? I'd love to hear about this new technology.
We are going circles because you are trying really, really hard to dismiss evidence that you don't like. As I said above the Sapphire Pulse model is a mere 2% faster than reference, this argument is stupid. The reference model runs fine during typical gaming workloads, speed wise.
Navi shows one of the smallest gaps between reference and AIB models in the last few generations that we've seen. How the hell does that work if AMD made a shitty GPU that can't run at full speed due to thermal throttling if the AIB models eliminate this possibility ?
Its times like these that common sense gets you places. Try it someday. Calling the argument stupid because you cannot quantify things, is not usually a good idea.
Fact is older GPUs do not have this feature at all and all of them ran fine and did not pre-maturely die because it.
Also starting and stopping the fans more often than otherwise is actually slightly detrimental to the life span of the fans.
For motors the ideal condition is actually to run them at a steady state.
This is the same reason why you don't want to start and stop your HDD motor too often.
And even if that would be the case, it's not just the temperature delta that matters, the frequency of these deltas is what really may have an effect on the material. And thankfully, GPU usually run at high constant temperatures for extended periods of times and idle at low constant temperature for the rest of the time.
Given that temps are reached on the ref card and that today we see know AIBs drop card temperatures by good 25+ degrees, could we find another reason to get offended? Like lack of cross fire or something? He literally chewed it for you, let me repeat the relevant part, perhaps you'd get it in second go: had thermals been a problem, gap between AIB and ref cards would be much bigger than 3-5% that we see now (especially taking into account much lover temps on AIBs).
If it is a problem, then these cards will start failing and people will complain about it. If we subscribe to the bathtub model of component failure, there should be a large percentage of the total failures for a product early on, due to defective cards or if this heat is really a problem, so it shouldn't take too long to tell if the GPU is immolating itself. It's not like every 5700 will last for 3 years 1 month and then burn up after the warranty is through. If the heat is a problem, we'll hear about it soon and people will still be under warranty.
Also, this line, is a bit of head scratcher
"the relevant part, perhaps you'd get it in second go: had thermals been a problem, gap between AIB and ref cards would be much bigger than 3-5%"
Actually... not having headroom while still having lower temps is a clear sign the card is clocked straight to the limit out of the box, and this also echoes in the GN review. @TheinsanegamerN worded it nicely, ref design is like a car running at top speed full in the red zone all the time, and considering that normal is a rather weird approach. The GN review also handiily points out memory ICs are also a hair below running out of spec. Now, imagine what happens with a bit of dust, wear and tear over time - or in fact, in most use cases outside the review bench. The throttling will get worse, and that peak temp won't be lower for it.
Check the clock speeds page and compare between the two, the frequency in the reference is all over the place once it starts to reach 91C and as i said above theres a case some of them in warm environments that they even shutdown.
AMD cheap out their cooler that is a fact even knowing about the thermal density issue...and now they come with the "oh it's fine".
They did the same in the CPU department
Its all about profits with these corporations.
we are living in a time when truth has been so diminished in value that even thosse at the top are quite comfortable with truth being whatever they can convince people to believe
I need to complete a woodworking project, for there to even be a place for a PC with monitor (my current something is hooked to a TV and that's not the way I'd like to play games).
Besides, AIBs are not really available yet. Clearly nothing, but who cares about ref cards anyway. Actually, talk was about thermal design and horrors that nvidia GPU owners feel, for some reason, for 5700 XT ref GPU owners.
Now that we've covered that, NV's 2070s (I didn't check others) AIBs aren't great OCers either, diff between Ref and AIB performance is also similar between brands.
But these GPUs aren't mainstream. To be mainstream, they have to offer more than just performance/price ratio. There's so much to improve in thermals, efficiency and stability. In marketing and support as well.
Nvidia's cards are so much more attractive, because Nvidia sells a polished, complete product. AMD sells a DIY project.
This becomes obvious when you look at what some of AMD's custom GPU clients can achieve. Apple, Sony, Microsoft and soon Samsung - they're offering AMD's chips in a much easier to digest form.
Of course AMD could make more robust products. They could do better pre-launch testing, improve compatibility and drivers. And work on relations with partners to deliver AIB cards and OEM systems on day of launch (like Nvidia and Intel do). But that would raise costs and - at least for now - AMD wants to remain the cheaper alternative. It's a conscious decision. First of all: is this your intuition or are there some publications to support this hypothesis? :-)
Second: you seem a bit confused. The passive cooling does not increase the number of times the fan starts. The fan is not switching on and off during gaming.
If the game applies a lot of load, the fan will be on during the whole session. Otherwise the fan is off.
So the number of starts and stops is roughly the same. It's just that your fan starts during boot and mine during game launch. So I don't have to listen to it when I'm not gaming (90% of the time).
In fact it actually decreases the number of starts for those of us who don't play games every day.
Tbf all cards use crap thermal compound/pads, why? Cheap in bulk.
getting from 0 to 100mph is where you’re going to be doing the most ‘damage’ - if you do it in a quarter mile, you’re really stressing the car, but if you take 20 miles to get to that speed, your wear and tear is much less, due to less torque. Once you get to that speed, it doesn’t much matter if you’re driving a muscle car or a Prius, as long as the overdrive gear is set up to sip fuel (or pull juice from the battery) just enough to overcome 100mph drag.