Trying to sort out instabilities after my GPU upgrade.

Todestrieb · Dec 31, 2023

I can see the NEW PSU!! argument, and maybe 4080 will be kinder to my PSUs, but it is what it is...

Update: Flipped the PCIe link state switch to No Power savings to stress more on the PSU. Situations that otherwise will instant reboot are now fine.
It is ridiculously hot here for December (I guess ambient is 20C~22C), and on specific scenes VRAM have reached 100C even after I bumped up GPU fan to from ~30% to ~55%. Probably still a fine temperature, but for the bad part of the following, take this into account.
Also, due to messy cable management, I have fears that I bent the 8pin cables too hard. (they are not "folded" , just mashed quite a bit, to be clear.)

CPU+GPU folding and CPU folding + Port Royal stress test are fine now. Previously it is possible to cause instant reboot.
FM8 encountered two environment disappear glitch in quick succession after ~30mins. It is a known issue so I will let it go. And then there is a lighting glitch after a ~30min idling on race preparation menu (the one with track already loaded and rendered). This is probably not acknowledged yet, but I have seen rare occurrence on other NVIDIA users (IIRC it's a 4060), so I will also let it go.
In general, glitches and crashes appeared much less frequently.

Here is the hard part: FH5 is the only game I have encountered any driver timeout. Apparently after all these years FH5 still doesn't like a lot of things , and among them are Afterburner + RTSS (removed before starting this thread), potentially Steam's FPS counter, and some. I have some browser games running background, and the known issue page says it likes fresh boot, there are numerous places that it is gonna crash, and Steam version may have memory leak, so while those issues are nearly nonexistent when I use the 3070, the following bad results can be excused as "AMD drivers LOL" or "Come on devs".
*These tests are done earlier, so the link state thing is flipped to Moderate Power Savings" for this session. I will retest FH5 days later, and all bad things happened in FH5 can be excused.
Not fresh boot, crashed after ~45mins, and then a bigger crash in <10mins (one with memory cannot be read error, and broke explorer.exe). At this time I noticed one 8pin was not fully plugged on GPU side.
Fresh boot, idled on free roam for 1hr, monitor turned off (so I will let this one go), environmental disappear glitch. Restart game, survived ~30mins before a driver timeout.
* Again, for now, I'm gonna excuse these bad results as "AMD drivers LOL" or "Come on devs". Whatever crashes are already much less frequent than before.
It was at this moment, I realize both FM8 and FH5, the two games I play the most, is not the best example to test GPU stability. Especially FH5.

If there are other signs that the card is actually bad (most likely another otherwise stable game that I can nearly consistently crash with no excuses) I will try to RMA the card.

I had a feeling that this part is slightly off topic and ruins readability of this thread, so I stuck this part in a spoilerbox.

TimatPSUTest.com said:
Here are my thoughts on this:

For reference the ATX spec uses PS_On, the Nuvoton chip PSON#. I will use PSON# for this post.

This is somewhat of a longshot but this *could* be a PSON# incompatibility between the power supply and motherboard. The symptom is the computer re-boots without a blue screen. If there is an error log it will be kernel power event ID 41.

There are lots of crashes that didn't cause a reboot, but the new PSU largely (if not completely) fixes the crashes and reboots. There are enough evidence that my old PSU is not good enough.
I can see the explanations are about instant reboots. There are Kernel power event ID 41 here for the reboots I had.

I have to look for what PS_ON actually means. Here is what wikipedia says: PS-ON Signal is a pin on a 20-pin or 24-pin ATX-specified power connector used turn on/off a personal computer power supply unit. It turns on the power supply when it is switched from high to low and turns off the power supply when switched from low to high, or open-circuited.

To my untrained eyes my 24pin ATX cable on the old PSU looks completely fine (no harsh bends, no metals shown on the cable side), but I have once loosened that by accident and caused a very steep voltage drop and all sorts of BSODs.

TimatPSUTest.com said:
However, there could be some other issues:

The NTC6797D could have a resistor on the motherboard to protect it from the outside world.

The power supply may not meet the ATX Power Supply spec. I have tested several that don’t.

There also is a “ground loop” involved. The NCT6797D chip is pulling to DC common on the motherboard: The power supply supervisor chip is connected to DC common in the power supply and there is a voltage drop in the wires between the two.

I would try to minimize the resistance in the DC common leads between the power supply and the motherboard. You have already checked the connectors to ensure they are plugged in all the way. You may want to look closely at the contacts and what you can see of the crimps. Try to keep this to a minimum as some contacts have a durability rating of less than 100 cycles. (The true Molex Mini Fit Jr contacts will last way more than specified). There is a picture online of damage to the contacts on the 24-pin connector due to testing with a paper clip. (It was fixed by the user by bending the contacts).

If you happen to have a voltmeter and /or an oscilloscope you might want to measure the voltage drop in the DC common between your power supply and your motherboard (close to the NCT6797D chip). Also the voltage on the PSON# pin on the 24-pin connector. I would recommend putting the negative lead of your meter on an unloaded connector from your power supply such as an unused peripheral connector and connect while the PSU is off. Keep in mind the 5Vsb can take several minutes to go to zero after the power supply is de-energized.

Here are some more tests you may want to try with you power supply disconnected from your system:

If you happen to have a 249-ohm resistor (1% standard value) you could test the PSON# of your power supply. For example, the spec calls for <= 1.6mA at 0.4 volts. This comes out to a resistance of 250 ohms between DC common and PSON# the end of the 24-pin connector. (0.40/250=1.6mA) so the voltage should be <= 0.40 volts under this condition. If you have a resistor that is close you can scale the voltage and current.

If you have access to a 1K pot: Connect it in place of the resistor while set to maximum and adjust down until the power supply turns on. The voltage should be >= to 0.80. Then adjust the pot until the voltage is 0.40 volts. Then disconnect the pot and measure its resistance. It should be 250 or more ohms. You could also test for a hysteresis of 0.3 volts (see spec).

Now my English reading comprehension / high-school level physics and electronics knowledge failed me. I'm trying to understand what you say here. Also guessing from what you suggested to test, my summary is: there are a few possibilities of what actually happened (or should I say what you think actually happened) in an instant reboot event: (I'm also guessing these are supposed to be in point form.)

- The NTC6797D could have a resistor... -> completely not sure what's happening here, guessing weirdness in voltages.

- The power supply may not meet the ATX Power Supply spec -> Voltages in the PSON# pin from PSU side is not correct.

- Ground loop -> In a transient event, due to the ground loop, voltage drop leads to current towards PSON# on the chip having higher voltage than it should have been, which leads to a power-off event. PSON# go normal again, which leads to a power-on.

I don't have the equipments to test, and my stupid hands are gonna mess up and potentially damage stuff if I proceed to test, and my brother have plans to sell the PSU with my 3070 as a bundle offer to close friends, so the truth will be forever hidden. Whatever the expression should be.
But these are very plausible and interesting explanations and exercises to learn more about PSUs. Thank you very very much for your time looking into my case here.

Assimilator · Dec 31, 2023

At the risk of being called an NVIDIA shill, I would just stick with the GPU that doesn't crash the games that I play most often. It's not an upgrade if you have a worse experience after it.

Todestrieb · Dec 31, 2023

Assimilator said:
At the risk of being called an NVIDIA shill, I would just stick with the GPU that doesn't crash the games that I play most often. It's not an upgrade if you have a worse experience after it.

I know, if there is no easy workaround I will sit down and at least have another long hard thought process.

Toothless · Dec 31, 2023

Todestrieb said:
I know, if there is no easy workaround I will sit down and at least have another long hard thought process.

Download the AMD driver cleanup tool and latest drivers. Unplug from internet and run the cleanup from safemode, reboot, install drivers, reboot. This fixed my latest driver related issues.

Todestrieb · Sep 30, 2024

Very very late update:
The GPU was defective. Sent to repair. Took a month to come back.
Temps are much worse: GPU temp roughly the same at ~73C, but there are games that may push GPU hotspot or VRAM to above 110C (very occasionally, but I think I have seen VRAM at 114C), and that shouldn't happen before repair (both GPU hotspot and VRAM should be at most 102C).
The card arrived with one or two screws a few turns loose, but tightening them up didn't change the temps noticeably.

But at least that means whoever handled this have take apart my GPU and fixed something. And the GPU works correctly and completely fine now.
Whatever performance issues the temps are causing are not noticable as long as I don't peek at GPU-Z / HWINFO and the fans don't spin up crazy. (They didn't spin up hard enough to be annoying during the tests that I pushed hotspot above 110C.)

I have considered to redo pastes & pads, but with my slightly high tendency to screw up, I will leave the card as is for now.

Moral of the story: If GPU crashes out / throws corrupted frames on many games, it's pretty damn obvious that the GPU is faulty.
Excuses: I was on a fatal dose of copium considering the first reason I bought this GPU is Forza Motorsport 8, which hated 7900XTX on launch day. This hazed my thought processes as I was thinking MS didn't fix every issues on 7900XTX. Until I opened another heavy enough game that is not related to Forza series and supposed old enough not to have issues, yet the GPU was still crashing frequently. Yes, it took 8 months for me to have the mood to open such a game. Really.
And there was a time the old PSU and GPU combination will cause reboot on gaming sessions. (It's a seperate issue.)
*buries my head in sand*

Onasi · Sep 30, 2024

Good to have an update and know that at least the GPU… somewhat works? After… 9 months? That’s some necromancy.

Toothless · Sep 30, 2024

Sounds like whoever worked on it knew how to "fix" but not open or close a card. Time to get pads and a 7950 pad to get the poor thing fully working right.

System Name	The New, Improved, Vicious, Stable, Silent Gaming Space Heater
Processor	Ryzen 7 5800X3D
Motherboard	MSI B450 Tomahawk Max
Cooling	be quiet! DRP4 (w/ added SilentWings3 120mm), 4x Noctua A14x25G2 (3 @ front, 1 @ back)
Memory	Mixed DDR4-3600 16GBx2+8GBx2 @18-20-20-20-40
Video Card(s)	PowerColor RX7900XTX HellHound
Storage	ADATA SX8200Pro 1TB, Crucial P3+ 4TB (w/riser, @Gen2x4), Seagate 3+1TB HDD, Micron 5300 7.68TB SATA
Display(s)	ASUS ROG PG27UCDM (4K240Hz QDOLED), Gigabyte M27U (4K160HzIPS)
Case	Phanteks P600S
Audio Device(s)	Creative Katana V2X gaming soundbar
Power Supply	Seasonic Vertex GX-1200 (ATX3.0 compliant)
Mouse	Razer Deathadder V3 wired
Keyboard	Keychron Q6Max w/ Gateron G Pro 3.0 Black linear switches

System Name	Firelance.
Processor	Threadripper 3960X
Motherboard	ROG Strix TRX40-E Gaming
Cooling	IceGem 360 + 6x Arctic Cooling P12
Memory	8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s)	MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage	2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s)	Dell S3221QS(A) (32" 38x21 60Hz) + 2x AOC Q32E2N (32" 25x14 75Hz)
Case	Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply	Fractal Design Ion+ 2 Platinum 760W
Mouse	Logitech G604
Keyboard	Razer Pro Type Ultra
Software	Windows 10 Professional x64

System Name	The New, Improved, Vicious, Stable, Silent Gaming Space Heater
Processor	Ryzen 7 5800X3D
Motherboard	MSI B450 Tomahawk Max
Cooling	be quiet! DRP4 (w/ added SilentWings3 120mm), 4x Noctua A14x25G2 (3 @ front, 1 @ back)
Memory	Mixed DDR4-3600 16GBx2+8GBx2 @18-20-20-20-40
Video Card(s)	PowerColor RX7900XTX HellHound
Storage	ADATA SX8200Pro 1TB, Crucial P3+ 4TB (w/riser, @Gen2x4), Seagate 3+1TB HDD, Micron 5300 7.68TB SATA
Display(s)	ASUS ROG PG27UCDM (4K240Hz QDOLED), Gigabyte M27U (4K160HzIPS)
Case	Phanteks P600S
Audio Device(s)	Creative Katana V2X gaming soundbar
Power Supply	Seasonic Vertex GX-1200 (ATX3.0 compliant)
Mouse	Razer Deathadder V3 wired
Keyboard	Keychron Q6Max w/ Gateron G Pro 3.0 Black linear switches

System Name	Veral
Processor	7800x3D
Motherboard	x670e Asus Crosshair Hero
Cooling	Thermalright Phantom Spirit 120 EVO
Memory	2x24 Klevv Cras V RGB
Video Card(s)	Powercolor 7900XTX Red Devil
Storage	Crucial P5 Plus 1TB, Samsung 980 1TB, Teamgroup MP34 4TB
Display(s)	Acer Nitro XZ342CK Pbmiiphx, 2x AOC 2425W, AOC I1601FWUX
Case	Fractal Design Meshify Lite 2
Audio Device(s)	Blue Yeti + SteelSeries Arctis 5 / Samsung HW-T550
Power Supply	Corsair HX850
Mouse	Corsair Harpoon
Keyboard	Corsair K55
VR HMD	HP Reverb G2
Software	Windows 11 Professional
Benchmark Scores	PEBCAK

System Name	The New, Improved, Vicious, Stable, Silent Gaming Space Heater
Processor	Ryzen 7 5800X3D
Motherboard	MSI B450 Tomahawk Max
Cooling	be quiet! DRP4 (w/ added SilentWings3 120mm), 4x Noctua A14x25G2 (3 @ front, 1 @ back)
Memory	Mixed DDR4-3600 16GBx2+8GBx2 @18-20-20-20-40
Video Card(s)	PowerColor RX7900XTX HellHound
Storage	ADATA SX8200Pro 1TB, Crucial P3+ 4TB (w/riser, @Gen2x4), Seagate 3+1TB HDD, Micron 5300 7.68TB SATA
Display(s)	ASUS ROG PG27UCDM (4K240Hz QDOLED), Gigabyte M27U (4K160HzIPS)
Case	Phanteks P600S
Audio Device(s)	Creative Katana V2X gaming soundbar
Power Supply	Seasonic Vertex GX-1200 (ATX3.0 compliant)
Mouse	Razer Deathadder V3 wired
Keyboard	Keychron Q6Max w/ Gateron G Pro 3.0 Black linear switches

Trying to sort out instabilities after my GPU upgrade.

Todestrieb

Assimilator

Todestrieb

Toothless

Tech, Games, and TPU!

Todestrieb

Onasi

Toothless

Tech, Games, and TPU!

System Name	The Workhorse
Processor	AMD Ryzen R9 5900X
Motherboard	Gigabyte Aorus B550 Pro
Cooling	CPU - Noctua NH-D15S Case - 3 Noctua NF-A14 PWM at the bottom, 2 Fractal Design 180mm at the front
Memory	GSkill Trident Z 3200CL14
Video Card(s)	NVidia GTX 1070 MSI QuickSilver
Storage	Adata SX8200Pro 1 TB
Display(s)	LG 32GK850G
Case	Fractal Design Torrent (Solid)
Audio Device(s)	Sennheiser HD598, FiiO E-10K DAC/AMP, Samson Meteorite USB Microphone
Power Supply	Corsair RMx850 (2018)
Mouse	Zaopin Z1 Pro on a X-Raypad Equate Plus V2
Keyboard	Cooler Master QuickFire Rapid TKL (Cherry MX Black)
Software	Windows 11 Pro (24H2)