# @Devs What is "GPU Temperature (Hot Spot)" on RX Vega?



## kiffmet (Sep 7, 2017)

I guess the title already says it. I would like to know from which sensor exactly that data is being pulled from.


----------



## kiffmet (Sep 7, 2017)

I directed my question directly to the devs in order to avoid getting distracting answers. I know you meant nothing but good but to be honest pointing me to Guru3D didn't help at all, as this is a question only the programmer of GPU-Z can answer. He is the one who picked the sensor with his own code after all..


----------



## W1zzard (Sep 7, 2017)

It's a sensor inside the GPU silicon. Probably (going by the name9 at a location where it gets hottest. That's all I know.

You also asked about HBM temperature location in a post that's now deleted due to cleanup. No idea again. The card gives me a sensor "HBM temperature" that's all I know


----------



## MrGenius (Sep 7, 2017)

He's not the only one who knows what it means. It was implemented by AMD. All he did was make it so his program shows the data reported by it. Frankly, as he's not a GPU designer/engineer, I'd be pretty surprised if he knew anything at all about it. It was rather stupid in my view for the "who-the-hell-ever-he-is" guy over at AMD to suggest you ask the dev of GPU-Z to explain how/what/where/why AMD designs the temperature sensors on their GPUs. And rather smart for you to reply "But isn't this temperature value something that is provided by the card's firmware?". Since it's pretty obvious that's the case.

He just replied as I was typing this. Good to know I wasn't wrong about that.


----------



## kiffmet (Sep 7, 2017)

Thanks for your replies then. I was asking that particular person on the AMD forums as hes the only one from the staff i know who at least answers when mentioned in a post. I guess I'll just nag them until they provide an answer lol.


----------



## DRDNA (Sep 7, 2017)

MrGenius said:


> He's not the only one who knows what it means. It was implemented by AMD. All he did was make it so his program shows the data reported by it. Frankly, as he's not a GPU designer/engineer, I'd be pretty surprised if he knew anything at all about it. It was rather stupid in my view for the "who-the-hell-ever-he-is" guy over at AMD to suggest you ask the dev of GPU-Z to explain how/what/where/why AMD designs the temperature sensors on their GPUs. And rather smart for you to reply "But isn't this temperature value something that is provided by the card's firmware?". Since it's pretty obvious that's the case.
> 
> He just replied as I was typing this. Good to know I wasn't wrong about that.


I believe W1zzard knows more than you think, I believe he use to work for ATI back in the day.


----------



## MrGenius (Sep 7, 2017)

DRDNA said:


> I believe W1zzard knows more than you think, I believe he use to work for ATI back in the day.


Oh I know he knows his stuff. And most certainly a lot more than I know. I was just guessing he wouldn't know about this particular feature on Vega. Which, to the best of my knowledge, is entirely new and only found on Vega. And hasn't been mentioned in any documentation(that I've seen). So unless he actually worked on it(which I figured I'd have heard about if he did)...it just didn't seem likely he'd know any more than the rest of us. Which is nothing...yet.


----------



## kiffmet (Sep 7, 2017)

After a bit of digging in the AMD opensource Linux DRM drivers, I found this: https://cgit.freedesktop.org/~agd5f...hwmgr/vega10_thermal.h?h=amd-staging-drm-next . Appearantly the VRM temperature can also be read out. There is no information on sensor placement however, except whats already within the name of the temp.


----------



## W1zzard (Sep 8, 2017)

kiffmet said:


> After a bit of digging in the AMD opensource Linux DRM drivers, I found this: https://cgit.freedesktop.org/~agd5f...hwmgr/vega10_thermal.h?h=amd-staging-drm-next . Appearantly the VRM temperature can also be read out. There is no information on sensor placement however, except whats already within the name of the temp.


Yup, those sources are a great reference


----------



## W1zzard (Sep 8, 2017)

MrGenius said:


> he wouldn't know about this particular feature on Vega. Which, to the best of my knowledge, is entirely new and only found on Vega.


which feature?


----------



## Filip Georgievski (Sep 8, 2017)

AMD GCN silicon gets hot very quick, and by my experience in most generations of MAD GPUs starting from 5xxx all the way up to RX 4xx and 5xx, max temps go all the way up to 90C if not in a proper ventilated case and room temp.
Most good cooled AMD cards rely on ambient temp, good case and added fans to work in good temps.
Mine doesn't go above 70C with 4 extra fans in case, and air conditioning in room on a cool 23C in summer, in winter no need for air conditioning, just open a window.
TJ max would be 100C by my exp, but optimal temps would be around 65 - 75C.


----------



## TheoneandonlyMrK (Sep 8, 2017)

W1zzard said:


> It's a sensor inside the GPU silicon. Probably (going by the name9 at a location where it gets hottest. That's all I know.
> 
> You also asked about HBM temperature location in a post that's now deleted due to cleanup. No idea again. The card gives me a sensor "HBM temperature" that's all I know


Sounds a lot like something I said Bossman Ty.


----------



## MrGenius (Sep 8, 2017)

W1zzard said:


> which feature?


The GPU hot spot temp sensor...I guess? Which may or may not qualify as a "feature". I might have worded that poorly. As well as everything else I've said in this thread. I probably should have just kept my mouth shut. 

I am sort of curious about it though. My question at this point is, is it only found on Vega? I noticed yesterday while using Polaris Bios Editor that there's a "Hotspot Temp (C)" value under POWERTUNE for Polaris 20, Ellesmere, Baffin, and Lexa. Which makes me think there's got to be a sensor for it on those too.


----------



## W1zzard (Sep 8, 2017)

Has been there for a while, exposed just now.


----------



## kiffmet (Oct 20, 2017)

Would it theoretically be possible to add support for reading VR_SOC and VR_MEM temps on Vega with GPU-Z?


----------



## delshay (Mar 27, 2018)

HBM has built-in thermal sensor from what I understand from the datasheet. So should there not be two thermal readings one for each HBM die?


----------



## Sasqui (Mar 27, 2018)

delshay said:


> HBM has built-in thermal sensor from what I understand from the datasheet. So should there not be two thermal readings one for each HBM die?



Sounds like that would be the same as the memory temperature, no?


----------



## Assimilator (Mar 27, 2018)

... how do you expect a random software developer to know how a hardware manufacturer, that they have no relationship with, exposes its hardware's sensor data? We can't smell these things y'know, we're just as dependent as anyone on the hardware company providing documentation on where that sensor data lives in memory, how to access it, and how to interpret it into a number that actually makes sense to an end-user.

I mean, yeah, you _could_ spend hours peeking and poking through various memory locations to guess at this stuff... or you could save yourself a ton of time and effort and just use what the manufacturer provides... I know which one I go with.


----------



## TheoneandonlyMrK (Mar 27, 2018)

Sorry it's off topic but I replied to this thread  like originally post two ish early on and its gone , no insult just info , please find a better path ,editing out my help will stop me helping.......
As it's out and out offensive , i told the Op what it was exactly and even before w1zzard , i have a vega and I know  that stuff.
Required a dev tut .

Seen the thread here today i thought it new since its been cutting edited


----------



## Sasqui (Mar 27, 2018)

Assimilator said:


> or you could save yourself a ton of time and effort and just use what the manufacturer provides



Unless you've seen something different, the only thing I've seen them (AMD) provide is the GPU Core temp.  It's up to 3rd parties (i.e. GPU-Z) to read and display other sensor info that's been exposed.

The OP's question was about the meaning of the "hot spot" temp sensor... and it sounds like AMD hasn't given much info on the significance of that value, or where the sensor is located.


----------



## TheoneandonlyMrK (Mar 27, 2018)

Sasqui said:


> Unless you've seen something different, the only thing I've seen them (AMD) provide is the GPU Core temp.  It's up to 3rd parties (i.e. GPU-Z) to read and display other sensor info that's been exposed.
> 
> The OP's question was about the meaning of the "hot spot" temp sensor... and it sounds like AMD hasn't given much info on the significance of that value, or where the sensor is located.


The sensor? Is all the temp sensors , it's the hottest spot, vega is built on infinity fabric which is a bus and control network including sensor's and each chip has its own additional sensors but the hot spot in such chip terms is the hottest spot.
And is king of the thermal throttle hill so to speak, as i previously said ,ish.


----------



## Sasqui (Mar 27, 2018)

theoneandonlymrk said:


> And is king of the thermal throttle hill so to speak



Does the hot spot factor into thermal throttling?  It doesn't seem to, if I'm hitting 97c with mine...  the GPU throttle is set at 85c, and I think the Mem throttle is at 85c also.   Overclocked, my hotspot was 97c. Core was 75c and Mem at 85c ... that was at a core speed topping out at 1733 and mem at 1050 (according to GPU-Z).  Core was undervolted to 1050 Mv for both P6 and P7


----------



## delshay (Mar 27, 2018)

Sasqui said:


> Sounds like that would be the same as the memory temperature, no?



Have you read HBM Memory PDF? or have I misunderstood something. It does not look right if each HBM die has it's own built-in thermal sensor.

I would expect something like HBM thermal 0 & HBM thermal 1 (example), but just one HBM thermal temperature reading to cover both die.

What if one of HBM memory die was making poor contact, how would you know which one?

HBM1 also has built-in thermal sensor if my memory serves me well.

If I am missing or misunderstood something, can someone please post more technical details.


----------



## Sasqui (Mar 27, 2018)

delshay said:


> Have you read HBM Memory PDF? or have I misunderstood something. It does not look right if each HBM die has it's own built-in thermal sensor.



No, I haven't read the data sheets.  Yes, as far as I know, there are two HBM chips on the GPU chip, I assume they have a sensor only on one, or only expose information for one.

On another note, my Vega 64 started throwing out some weird readings:


----------



## delshay (Mar 28, 2018)

Sasqui said:


> No, I haven't read the data sheets.  Yes, as far as I know, there are two HBM chips on the GPU chip, I assume they have a sensor only on one, or only expose information for one.
> 
> On another note, my Vega 64 started throwing out some weird readings:
> 
> View attachment 98874



You can't have a sensor on just one HBM, both should be connected. You have separate dies & for safety/monitoring, each HBM die has it's own Thermal features. " take a glance over at the JEDEC PDF Docs", that's what I did.

If one HBM is overheating how would you know this is happening if it taking a reading from the other. Your CPU has thermal reading for each core, HBM is no different. The ability to monitor each die is important. Fuji chip has four HBM stack, so you should be seeing Thermal 0 to Thermal 3.

You can't just have one thermal read-out for all HBM die when connected to the main Vega/Fuji die, that's not how things are done.


----------



## Sasqui (Mar 28, 2018)

delshay said:


> You can't have a sensor on just one HBM, both should be connected. You have separate dies & for safety/monitoring, each HBM die has it's own Thermal features. " take a glance over at the JEDEC PDF Docs", that's what I did.
> 
> If one HBM is overheating how would you know this is happening if it taking a reading from the other. Your CPU has thermal reading for each core, HBM is no different. The ability to monitor each die is important. Fuji chip has four HBM stack, so you should be seeing Thermal 0 to Thermal 3.
> 
> You can't just have one thermal read-out for all HBM die when connected to the main Vega/Fuji die, that's not how things are done.



I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same.  Who knows?  Without documentation from AMD, it's all speculation.


----------



## delshay (Mar 28, 2018)

Sasqui said:


> I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same.  Who knows?  Without documentation from AMD, it's all speculation.



I find it hard to believe AMD would design something & not fully implement it fully into Fuji/Vega. Another way is to look at other products that use HBM & check if thermal readout for each die is active. Volta is good example as it has four stacks of HBM.

Who's to say Fuji/Vega owners throttling issues are related to something you can't see.


----------



## Sasqui (Mar 28, 2018)

delshay said:


> Who's to say Fuji/Vega owners throttling issues are related to something you can't see.



From playing with Wattman for hours, it's clear to me that power limits are the number one throttling factor, if you have good cooling.  If you don't have good cooling, then the 85c Core OR 85c Mem temp limits will hold it back.  I got my core _stable _up to 1733 with voltage at 1050 Mv (Default is 1200 Mv).  That was with power limit at it's max (50%), and the card was pulling 330W according to GPU-Z.  Could have fried an egg on the back of the card, but it kept going.  Fans were at full blast.  That's when I started to wonder about the mysterious "hot spot" temp sensor.

What's interesting is the GTX 10 series has a hard-wired power limit.  For instance, most 1070 cards can't get over 2100 core, no matter what you do.


----------



## EarthDog (Mar 28, 2018)

Sasqui said:


> What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.


But it isn't really due to power limits. Many can't hit 2100 MHz regardless if there is power limit headroom.


----------



## Sasqui (Mar 28, 2018)

EarthDog said:


> But it isn't really due to power limits. Many can't hit 2100 MHz regardless if there is power limit headroom.



It isn't an exact science, obviously.  Some silicon just won't do it for many reasons including cooling.  And I should qualify my statement, that's specifically with a 1070 Ti.  There are pencil mod guides out there to up the power limit too.  But out of the box, 2100 was the ceiling for just about every 1070 Ti review I saw and the card I played with.


----------



## EarthDog (Mar 28, 2018)

Correct. I was just saying it isn't the power limit only that is doing it (how I understood that post - if I was mistaking, apologies). It seems like its a silicon thing and lack of the ability to add significant voltage to it, power limits, and temps. Anyway... that Vega......


----------



## Sasqui (Mar 28, 2018)

EarthDog said:


> Correct. I was just saying it isn't the power limit only that is doing it



Yea, assuming good silicon, good voltage, current regulation and good cooling, the power limit is the final wall in the 1070 Ti PCB (and a lot of other GTX 10 cards I suspect). They (NVidia) did that to the 1070 Ti so it wouldn't cannibalize the 1080 sales, or so I understand.

Back to the mysterious hot spot on Vega... and power limits.  All things equal, when I set the power limit to 25%, the core tops out at 1650+, set it to 50% and it peaks at 1730+  ...and pulls one hell of a load for a GPU.  Again, that's undervolted to 1050 Mv on the core.  If I bump that up to 1100 Mv, both of the peak core speeds go down (indicating a power limit hit)


----------



## delshay (Mar 30, 2018)

Sasqui said:


> From playing with Wattman for hours, it's clear to me that power limits are the number one throttling factor, if you have good cooling.  If you don't have good cooling, then the 85c Core OR 85c Mem temp limits will hold it back.  I got my core _stable _up to 1733 with voltage at 1050 Mv (Default is 1200 Mv).  That was with power limit at it's max (50%), and the card was pulling 330W according to GPU-Z.  Could have fried an egg on the back of the card, but it kept going.  Fans were at full blast.  That's when I started to wonder about the mysterious "hot spot" temp sensor.
> 
> What's interesting is the GTX 10 series has a hard-wired power limit.  For instance, most 1070 cards can't get over 2100 core, no matter what you do.



Putting power limits to one side, you need to understand track thermal throttling, what's causing it. I own a R9 Nano which I will claim is the fastest R9 Nano card. It does not have the highest clock speed, but it throttles less than any other R9 Nano.

Keeping the R9 Nano main VRMs cooler reduces it's thermal throttling, but the main VRMs is not the real problem.
At the other end of the card are two minor VRMs, which are rated for 85c. Because the main VRMs is heating up the internal baseplate so hot it is either tripping the VRMs at the other end of the card or tripping the other two ICs directly behind the inductor/main VRMs on the other side of the card. The VRMs at the other end of the card is rated at 85c so is the two ICs behind the inductor/main VRMs on the other side of the card. I believe, this is what is tripping the card when it overheat.

The short story is for Vega cards, you need to look at the location of other ICs especially ICs that are mounted on the other side of the card directly behind or near the inductor/ VRMs & check their thermal limits by looking at their documentation. I do not own a Vega card so I can't check location of any ICs on the back of the card (if any).


----------



## delshay (Apr 4, 2018)

Sasqui said:


> I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same.  Who knows?  Without documentation from AMD, it's all speculation.



I believe the hardware integration of HBM thermal sensors on Fuji & Vega is not at fault. My guess is, & i'm only guessing here, it's a firmware problem.
I just can't see the hardware group getting this wrong.

This is why I want to see Firmware Update built-in into AMD Adrenalin Software so that we get "Firmware Update" direct from the manufacture.


----------



## Sasqui (Apr 4, 2018)

delshay said:


> I believe the hardware integration of HBM thermal sensors on Fuji & Vega is not at fault. My guess is, & i'm only guessing here, it's a firmware problem.
> I just can't see the hardware group getting this wrong.
> 
> This is why I want to see Firmware Update built-in into AMD Adrenalin Software so that we get "Firmware Update" direct from the manufacture.



Well here's one for you (I don't have a screen shot ATM)... when I overclock to a certain point, GPUz shows the HBM temp at *2100 degrees*, but the card is humming along just fine.  That tells me that the reported / exposed temperature sensor is not the same as the one responsible for throttling.  Or there's something entirely else going on.


----------

