- Joined
- Jan 1, 2012
- Messages
- 126 (0.03/day)
Hello! I went to register to seek feedback for an issue and troubleshooting I'm doing surrounding it only to find I had an account from a decade ago already. Funny. Anyway...
As the title says, I'm having issues. I've been "doing this" for close to two decades, but I'm never afraid to admit when I need help or might be wrong. And this one is twisting me up. It has what seems to be a likely cause, but I would like to get second opinions because there's things making me second guess myself. This will be a long one and I apologize! I have a summary at the top and I'll try and break it up and format it the best I can.
Summary of Symptoms:
The display will go Black, and the PC will restart. The time between display going Black and restarting varies, but there's some consistency depending on what situation led to it. There's no signs of the video drivers crashing. Not in Event Viewer, nor from AMD's Adrenalin software itself. The drivers are not uninstalling themselves nor is the device being lost on next Windows restart like some others users seem to report when this occurs. In fact the drivers seem solid, but they do tell me "default tuning performance settings have been restored due to an unexpected system failure" after such an issue occurs. There are no BSODs, and no minidumps or memory dumps being created either (yes, my page file is enabled and on system managed on my system drive, and I have automatic restart on BSOD disabled). Event Viewer does shows Event ID 18 however, which seems to be an AMD specific event logged in case of a machine check exception which reads as "a fatal hardware error has occurred". WHEA and WatchDog logs are also being created sometimes. More on the logs and details below. This started when changing my video card.
Hardware:
CPU-Z Link: https://valid.x86.fr/7s64nw
Case: Fractal Arc Midi R2
PSU: EVGA SuperNova 750 G5
CPU: AMD Ryzen 7 5800X3D
CPU Cooling: Be Quiet Dark Rock Pro 4
Motherboard: MSI Mag X570S Tomahawk Max WiFi (BIOS V1.8)
RAM: 64 GB (4x 16 GB) G.Skill Ripjaws V 3,600 MHz 1.35V
GPU: Sapphire Nitro+ Radeon RX 7800 XT
Storage: 2x Western Digital Black SN850X 2TB, 1x Western Digital Black 5 TB HDD, 2x Western Digital Blue 8 TB HDD
OS: Windows 10 Home 22H2 (19045.3693)
Detailed description of issues:
As stated above, the PC display will sometimes go Black and then restart to the BIOS. This has occurred under the following conditions.
1. Playing Minecraft Java with shaders. Sometimes it just happens during play, but routinely it's when I press F11 which initiates a change from full screen to Window, or within seconds of doing that. One time it successfully switched to window mode only to fail when attempting to render a windows explorer window.
2. Playing League of Legends.
3. Other light games (Aura Kingdom is one).
4. At the immediate start of attempting to do OCCT's "GPU variable" test. Unfortunately, this happened only once and isn't reproducible. I thought it might be the first time it happened, but it wasn't. No other OCCT tests, and no other stress tests period (including Furmark) have failed on me yet.
Most of the restarts happen rather quickly after the screen goes Black. I notice the first one, Minecraft in particular, tends to take longer for the restart to occur. Sometimes, it doesn't restart and I have to force the PC off... but I notice an odd thing even about this. My case has a fan speed control with a selection for 5V, 7V, and 12V. I often run these at 7V for noise reasons. The first time I went to force power it off, I accidentally switched the voltage from 7V to 12V dsue to it being near the power button, and when this was accidentally switched it triggered the restart. I thought it was coincidental... until this "Black screen to not automatic restart" happened again... so I let it sit to see what would happen, and it never restart, so I switched the voltage intentionally... and it restarted? Hm.
I'm not sure if that's important to mention or "fluff" but I want to be thorough.
My first step of troubleshooting is "if a new symptom arrived, what change coincided with said symptom" and that change was the graphics card. So that's it, right? That's my suspicion too, but I wanted to rule things out regardless. And I can't help but notice a few things.
In my troubleshooting (summarized below for formatting reasons), I found the issue seems to occur much more often when my CPU is using stock BIOS settings (read as, JEDEC RAM speeds and voltages) as opposed to my RAM profile speeds. Huh? Backwards from what I would expect because these "heavier" RAM setting is more stable. I first noticed the issue in Minecraft around a week after getting the video card. But I ignored it at first, as it coincided with an undervolt attempt on the CPU. So I figured I just didn't win the lottery and couldn't undervolt at all. But it happens at stock. In other words, XMP RAM speeds is unstable, XMP RAM speeds with a CPU undervolt is very unstable, and JEDEC RAM speeds with no CPU undervolt is equally unstable. I hope that makes sense, but the point is... despite the issue occurring with the video card change, I'm noticing a correlation based on platform settings as well. And it's calling my sanity into question on if it was ever stable, despite never having issues with my previous GTX 1060. I am running what I believe might be a heavy memory configuration (four DIMMs of dual rank)... but then why is if less stable at seemingly more tame RAM settings!?
Before I move on to the list of things I've tried, here's a summary of some of WHEA logs and Watch Dog logs. If the logs themselves would be helpful, please ask.
The Event Viewer always shows this under "Event ID 18".
"A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 1
The details view of this entry contains further information."
The APIC ID, which correlates to the logical CPU that threw the MCE, always differs.
WHEA logs always look like this.
"WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
nt!_WHEA_ERROR_RECORD structure that describes the error condition. Try !errrec Address of the nt!_WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: ffff800474797900, Address of the nt!_WHEA_ERROR_RECORD structure.
Arg3: 00000000bea00000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000108, Low order 32-bits of the MCi_STATUS value."
And the Watch Dog logs are giving me these.
"VIDEO_TDR_TIMEOUT_DETECTED (117)
The display driver failed to respond in timely fashion.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
Arguments:
Arg1: ffffaf8baadd7460, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff800540e8670, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, The secondary driver specific bucketing key.
Arg4: 00000000000005a8, Optional internal context dependent data."
"VIDEO_ENGINE_TIMEOUT_DETECTED (141)
One of the display engines failed to respond in timely fashion.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
Arguments:
Arg1: ffffda880ec2e010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff806a19b8790, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, The secondary driver specific bucketing key.
Arg4: 000000000000111c, Optional internal context dependent data."
"VIDEO_MINIPORT_BLACK_SCREEN_LIVEDUMP (1b8)
User initiated miniport black screen live dump.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
User initiated miniport live dump for black screen scenarios.
Arguments:
Arg1: 0000000000000001, Blackscreen hotkey generated miniport black screen live dump
Arg2: 0000000000000000, Reserved.
Arg3: 0000000000000000, Reserved.
Arg4: 0000000000000000, Reserved."
"VIDEO_DXGKRNL_BLACK_SCREEN_LIVEDUMP (1a8)
User initiated DXGKRNL black screen live dump.
User initiated DXGKRNL live dump for black screen scenarios.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
Arguments:
Arg1: 0000000000000001, Blackscreen hotkey generated DXGKRNL black screen live dump
Arg2: 0000000000000000, Reserved.
Arg3: 0000000000000000, Reserved.
Arg4: 0000000000000000, Reserved."
These seem very suggestive of the video card and/or drivers? And there's a lot of "Black screen to restart" issues with the 7800 XT going on it seems, but the fact they may be happening doesn't necessarily tell me what the cause might be. The issue did show up with the video card change, and the issue only seems to show up when the video card is under use. I can use my PC all day on the internet (even with hardware acceleration browsers or watching video) or load Photoshop and putting the CPU and RAM under load. It never crashes... until the video card is used more than modestly.
What I've tried through troubleshooting:
I posted this on the Steam forums originally, so I'll copy that part from there.
1. I've updated the motherboard BIOS. Originally it was V1.5, then V1.7, and now V1.8.
2. Windows 10 is up to date.
3. AMD chipset drivers are up to date. Audio drivers are up to date. Ethernet drivers are up to date. Bluetooth and WiFi drivers are up to date. Etc.
4. I've updated video card drivers as new ones have become available. The issue has persisted on all drivers I've tried, including 23.9.1, 23.9.3, 23.10.1, 23.10.2, and 23.11.1.
5. I've used DDU to uninstall and reinstall the video drivers. Yes, I used safe mode. Yes, I disconnected the internet.
6. I've reset the BIOS who knows how many times.
7. I've disabled XMP, and I've set XMP but scaled back RAM frequency/IF clocks a bit to 3,200 MHz/1,600 MHz respective. So it doesn't matter RAM/IF is set to 2,133 MHz (JEDEC default)/1,066 MHz or 3,200 MHz/1,600 MHz or 3,600 MHz/1,800 MHz respectively, they all have the issue. This seems to rule out RAM or Infinity Fabric instability?
8. I've run stress tests galore. Windows memory diagnostic (might not be very conclusive on its own but I did it), MemTest86+, Prime 95, BurnInTest, and the majority of the OCCT suite. All passed, with the exception of the "GPU variable" test in OCCT, which immediately caused the crash the first time I attempted it, but then succeeded on a subsequent attempts.
9. I've tried connecting the DP cable to both output ports on the video card (mine has two DP and two HDMI instead of three DP and one HDMI).
10. I've tried HDMI.
11. I've adjusted the ASPM setting (PCI Express > Link State Power Management > Off).
12. I've completely reinstalled Windows 10!
13. I've completely, and I mean completely, took my PC apart down to the part, cleaned it (though it was already rather clean), and reassembled it. This was to rule out a bad connection anywhere. I even swapped RAM around, and the CPU was also reseat.
14. The video card is a Sapphire Nitro+ RX 7800 XT which has a BIOS switch with three positions (one performance BIOS, one silent BIOS, and the other is just a mode that lets you change it on the fly with the Sapphire TriXX software). I've tried both BIOS/all three positions.
15. I've used "Driver Verifier" which is something Windows includes and followed the instructions here to stress test the drivers. This was inconclusive, but not entirely useless. Since the issue doesn't yet have a known reproducible, on demand cause, I have to wait, but this tends to cause it to occur sooner. Unfortunately, the Driver Verifier does not catch anything and give me a notice of any violations it detected. Maybe because the drivers are fine and the issue isn't drivers but hardware itself. I'm reading machine check exceptions are, as a rule, almost always hardware and not software.
16. I've found some people saying they suspect the issues may be the card boosting above where it should at fringe moments. I've tried limiting the boost to 2,429 MHz. Nonetheless, it made no difference.
17. I've tried disabling ULPS.
18. I've tried disabling MPO.
19. I've tried my old 3700X in place of the 5800X3D. It happens on both. I think this can rule out the CPU(s) on a hardware level.
20. I've tried manually disabling PBO in the BIOS instead of leaving it on Auto.
21. Temperatures have been monitored, and nothing seems to get to critical levels (CPU can spike high but it's always below 90C, often below 80C or even 70C, and the GPu is in the 50C or 60C range with hot spot often under 80C and never over 90C). To the contrary, it happens even in mild games with very low temperatures, and I even tested with the side panel off.
None of these troubleshooting steps have resolved the issue.
The sole troubleshooting step that has resolved this is removing my RX 7800 XT and putting the GTX 1060 back in. After confirming that resolved the issue, I then put the RX 7800 XT back in and tried a few more of my steps above (DDU, fresh drivers, limiting boost speeds in Adrenalin) and it's still happening.
Conclusion:
I apologize for showing up out of the Blue after a decade and dropping such a long post! I wanted to get second opinions to make sure I'm on the right track, or see if I'm oblivious to something someone who is much smarter than me suspects or knows about this.
At this step, I believe I almost have to try an RMA on the GPU? I have this worry this might not resolve it but that's merely a feeling. Maybe I need to set this aside and cross that bridge when i get there, so my plan now is to reach out to Sapphire for support.
My CPU (though I think I ruled this out), motherboard, and of course video card are under warranty.
I think my RAM might be, as it has a "lifetime" warranty but it is three and a half years old so maybe it's not.
The PSU is... complicated. It's technically under warranty until early next year, but EVGA gave me a brand new G5 I RMA'd near the end of the ten year warranty term to replace a G2 I had with a faulty fan (and I found the G5 is a slight downgrade but I don't know, and the G2 isn't made any more), so I wouldn't want to RMA this unless I first tried to at least RMA the part that caused the issue to show up. Honestly I'd probably just buy a new one but I'd only do this if someone was like... convinced it stood a good chance of fixing this. As the issue doesn't seem to correlated with high draw, I don't think I'm tripping OPP or OCP, but PSUs aren't my specialty.
And as a wild card, I have my prior motherboard, an Asus ROG Strix B550-F Gaming. I want to avoid having to swap that in if at all possible though due to the level of effort in entails, and because I also had separate issues with it during the time I used it (ask if you want details, but this thread is already so long I'd rather stick to this issue). I RMA'd it when a formal fault was discovered after buying my two M2 SSDs and finding one was not functional at all. Asus was slow to deal with and it cost me $50 to RMA a motherboard being sent one state away... not fun. I didn't even wait on the return and just bought the MSI to replace it (partly to reduce downtime and partly use PCI Express speeds in the second M2 port). Funny enough, the initial MSI I tried to buy never even succeeded in POSTing. I had to return that to Micro Center and the second one worked. Starting to wonder if I have a deeper issue here? But it was stable until the GPU was swapped. This one is spinning me in circles...
Any help would be greatly appreciated! I'm so desperate I'll gift you a (sanely priced) game on Steam if you can figure this one out for me. I just want it working. Imaging spending $600 to play Minecraft with shaders better and it leads to a nightmare and has you second guessing if the system was ever stable or if it's just a bad new part. It's sooo depressing. Anyway I'm going to try to RMA the GPU but if that never even gets that far or if Sapphire says it's fine, or return a new one and it also has the issue... then I won't know what to do. I feel like I've exhausted what i can and would be guessing at buying new parts at that point. But it's not worth buying new AM4/DDR4 stuff when I wanted to move to AM5 when the Zen 5 X3D launches so this would mess that up.
As the title says, I'm having issues. I've been "doing this" for close to two decades, but I'm never afraid to admit when I need help or might be wrong. And this one is twisting me up. It has what seems to be a likely cause, but I would like to get second opinions because there's things making me second guess myself. This will be a long one and I apologize! I have a summary at the top and I'll try and break it up and format it the best I can.
Summary of Symptoms:
The display will go Black, and the PC will restart. The time between display going Black and restarting varies, but there's some consistency depending on what situation led to it. There's no signs of the video drivers crashing. Not in Event Viewer, nor from AMD's Adrenalin software itself. The drivers are not uninstalling themselves nor is the device being lost on next Windows restart like some others users seem to report when this occurs. In fact the drivers seem solid, but they do tell me "default tuning performance settings have been restored due to an unexpected system failure" after such an issue occurs. There are no BSODs, and no minidumps or memory dumps being created either (yes, my page file is enabled and on system managed on my system drive, and I have automatic restart on BSOD disabled). Event Viewer does shows Event ID 18 however, which seems to be an AMD specific event logged in case of a machine check exception which reads as "a fatal hardware error has occurred". WHEA and WatchDog logs are also being created sometimes. More on the logs and details below. This started when changing my video card.
Hardware:
CPU-Z Link: https://valid.x86.fr/7s64nw
Case: Fractal Arc Midi R2
PSU: EVGA SuperNova 750 G5
CPU: AMD Ryzen 7 5800X3D
CPU Cooling: Be Quiet Dark Rock Pro 4
Motherboard: MSI Mag X570S Tomahawk Max WiFi (BIOS V1.8)
RAM: 64 GB (4x 16 GB) G.Skill Ripjaws V 3,600 MHz 1.35V
GPU: Sapphire Nitro+ Radeon RX 7800 XT
Storage: 2x Western Digital Black SN850X 2TB, 1x Western Digital Black 5 TB HDD, 2x Western Digital Blue 8 TB HDD
OS: Windows 10 Home 22H2 (19045.3693)
Detailed description of issues:
As stated above, the PC display will sometimes go Black and then restart to the BIOS. This has occurred under the following conditions.
1. Playing Minecraft Java with shaders. Sometimes it just happens during play, but routinely it's when I press F11 which initiates a change from full screen to Window, or within seconds of doing that. One time it successfully switched to window mode only to fail when attempting to render a windows explorer window.
2. Playing League of Legends.
3. Other light games (Aura Kingdom is one).
4. At the immediate start of attempting to do OCCT's "GPU variable" test. Unfortunately, this happened only once and isn't reproducible. I thought it might be the first time it happened, but it wasn't. No other OCCT tests, and no other stress tests period (including Furmark) have failed on me yet.
Most of the restarts happen rather quickly after the screen goes Black. I notice the first one, Minecraft in particular, tends to take longer for the restart to occur. Sometimes, it doesn't restart and I have to force the PC off... but I notice an odd thing even about this. My case has a fan speed control with a selection for 5V, 7V, and 12V. I often run these at 7V for noise reasons. The first time I went to force power it off, I accidentally switched the voltage from 7V to 12V dsue to it being near the power button, and when this was accidentally switched it triggered the restart. I thought it was coincidental... until this "Black screen to not automatic restart" happened again... so I let it sit to see what would happen, and it never restart, so I switched the voltage intentionally... and it restarted? Hm.
I'm not sure if that's important to mention or "fluff" but I want to be thorough.
My first step of troubleshooting is "if a new symptom arrived, what change coincided with said symptom" and that change was the graphics card. So that's it, right? That's my suspicion too, but I wanted to rule things out regardless. And I can't help but notice a few things.
In my troubleshooting (summarized below for formatting reasons), I found the issue seems to occur much more often when my CPU is using stock BIOS settings (read as, JEDEC RAM speeds and voltages) as opposed to my RAM profile speeds. Huh? Backwards from what I would expect because these "heavier" RAM setting is more stable. I first noticed the issue in Minecraft around a week after getting the video card. But I ignored it at first, as it coincided with an undervolt attempt on the CPU. So I figured I just didn't win the lottery and couldn't undervolt at all. But it happens at stock. In other words, XMP RAM speeds is unstable, XMP RAM speeds with a CPU undervolt is very unstable, and JEDEC RAM speeds with no CPU undervolt is equally unstable. I hope that makes sense, but the point is... despite the issue occurring with the video card change, I'm noticing a correlation based on platform settings as well. And it's calling my sanity into question on if it was ever stable, despite never having issues with my previous GTX 1060. I am running what I believe might be a heavy memory configuration (four DIMMs of dual rank)... but then why is if less stable at seemingly more tame RAM settings!?
Before I move on to the list of things I've tried, here's a summary of some of WHEA logs and Watch Dog logs. If the logs themselves would be helpful, please ask.
The Event Viewer always shows this under "Event ID 18".
"A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 1
The details view of this entry contains further information."
The APIC ID, which correlates to the logical CPU that threw the MCE, always differs.
WHEA logs always look like this.
"WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
nt!_WHEA_ERROR_RECORD structure that describes the error condition. Try !errrec Address of the nt!_WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: ffff800474797900, Address of the nt!_WHEA_ERROR_RECORD structure.
Arg3: 00000000bea00000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000108, Low order 32-bits of the MCi_STATUS value."
And the Watch Dog logs are giving me these.
"VIDEO_TDR_TIMEOUT_DETECTED (117)
The display driver failed to respond in timely fashion.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
Arguments:
Arg1: ffffaf8baadd7460, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff800540e8670, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, The secondary driver specific bucketing key.
Arg4: 00000000000005a8, Optional internal context dependent data."
"VIDEO_ENGINE_TIMEOUT_DETECTED (141)
One of the display engines failed to respond in timely fashion.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
Arguments:
Arg1: ffffda880ec2e010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff806a19b8790, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, The secondary driver specific bucketing key.
Arg4: 000000000000111c, Optional internal context dependent data."
"VIDEO_MINIPORT_BLACK_SCREEN_LIVEDUMP (1b8)
User initiated miniport black screen live dump.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
User initiated miniport live dump for black screen scenarios.
Arguments:
Arg1: 0000000000000001, Blackscreen hotkey generated miniport black screen live dump
Arg2: 0000000000000000, Reserved.
Arg3: 0000000000000000, Reserved.
Arg4: 0000000000000000, Reserved."
"VIDEO_DXGKRNL_BLACK_SCREEN_LIVEDUMP (1a8)
User initiated DXGKRNL black screen live dump.
User initiated DXGKRNL live dump for black screen scenarios.
(This code can never be used for a real BugCheck; it is used to identify live dumps.)
Arguments:
Arg1: 0000000000000001, Blackscreen hotkey generated DXGKRNL black screen live dump
Arg2: 0000000000000000, Reserved.
Arg3: 0000000000000000, Reserved.
Arg4: 0000000000000000, Reserved."
These seem very suggestive of the video card and/or drivers? And there's a lot of "Black screen to restart" issues with the 7800 XT going on it seems, but the fact they may be happening doesn't necessarily tell me what the cause might be. The issue did show up with the video card change, and the issue only seems to show up when the video card is under use. I can use my PC all day on the internet (even with hardware acceleration browsers or watching video) or load Photoshop and putting the CPU and RAM under load. It never crashes... until the video card is used more than modestly.
What I've tried through troubleshooting:
I posted this on the Steam forums originally, so I'll copy that part from there.
1. I've updated the motherboard BIOS. Originally it was V1.5, then V1.7, and now V1.8.
2. Windows 10 is up to date.
3. AMD chipset drivers are up to date. Audio drivers are up to date. Ethernet drivers are up to date. Bluetooth and WiFi drivers are up to date. Etc.
4. I've updated video card drivers as new ones have become available. The issue has persisted on all drivers I've tried, including 23.9.1, 23.9.3, 23.10.1, 23.10.2, and 23.11.1.
5. I've used DDU to uninstall and reinstall the video drivers. Yes, I used safe mode. Yes, I disconnected the internet.
6. I've reset the BIOS who knows how many times.
7. I've disabled XMP, and I've set XMP but scaled back RAM frequency/IF clocks a bit to 3,200 MHz/1,600 MHz respective. So it doesn't matter RAM/IF is set to 2,133 MHz (JEDEC default)/1,066 MHz or 3,200 MHz/1,600 MHz or 3,600 MHz/1,800 MHz respectively, they all have the issue. This seems to rule out RAM or Infinity Fabric instability?
8. I've run stress tests galore. Windows memory diagnostic (might not be very conclusive on its own but I did it), MemTest86+, Prime 95, BurnInTest, and the majority of the OCCT suite. All passed, with the exception of the "GPU variable" test in OCCT, which immediately caused the crash the first time I attempted it, but then succeeded on a subsequent attempts.
9. I've tried connecting the DP cable to both output ports on the video card (mine has two DP and two HDMI instead of three DP and one HDMI).
10. I've tried HDMI.
11. I've adjusted the ASPM setting (PCI Express > Link State Power Management > Off).
12. I've completely reinstalled Windows 10!
13. I've completely, and I mean completely, took my PC apart down to the part, cleaned it (though it was already rather clean), and reassembled it. This was to rule out a bad connection anywhere. I even swapped RAM around, and the CPU was also reseat.
14. The video card is a Sapphire Nitro+ RX 7800 XT which has a BIOS switch with three positions (one performance BIOS, one silent BIOS, and the other is just a mode that lets you change it on the fly with the Sapphire TriXX software). I've tried both BIOS/all three positions.
15. I've used "Driver Verifier" which is something Windows includes and followed the instructions here to stress test the drivers. This was inconclusive, but not entirely useless. Since the issue doesn't yet have a known reproducible, on demand cause, I have to wait, but this tends to cause it to occur sooner. Unfortunately, the Driver Verifier does not catch anything and give me a notice of any violations it detected. Maybe because the drivers are fine and the issue isn't drivers but hardware itself. I'm reading machine check exceptions are, as a rule, almost always hardware and not software.
16. I've found some people saying they suspect the issues may be the card boosting above where it should at fringe moments. I've tried limiting the boost to 2,429 MHz. Nonetheless, it made no difference.
17. I've tried disabling ULPS.
18. I've tried disabling MPO.
19. I've tried my old 3700X in place of the 5800X3D. It happens on both. I think this can rule out the CPU(s) on a hardware level.
20. I've tried manually disabling PBO in the BIOS instead of leaving it on Auto.
21. Temperatures have been monitored, and nothing seems to get to critical levels (CPU can spike high but it's always below 90C, often below 80C or even 70C, and the GPu is in the 50C or 60C range with hot spot often under 80C and never over 90C). To the contrary, it happens even in mild games with very low temperatures, and I even tested with the side panel off.
None of these troubleshooting steps have resolved the issue.
The sole troubleshooting step that has resolved this is removing my RX 7800 XT and putting the GTX 1060 back in. After confirming that resolved the issue, I then put the RX 7800 XT back in and tried a few more of my steps above (DDU, fresh drivers, limiting boost speeds in Adrenalin) and it's still happening.
Conclusion:
I apologize for showing up out of the Blue after a decade and dropping such a long post! I wanted to get second opinions to make sure I'm on the right track, or see if I'm oblivious to something someone who is much smarter than me suspects or knows about this.
At this step, I believe I almost have to try an RMA on the GPU? I have this worry this might not resolve it but that's merely a feeling. Maybe I need to set this aside and cross that bridge when i get there, so my plan now is to reach out to Sapphire for support.
My CPU (though I think I ruled this out), motherboard, and of course video card are under warranty.
I think my RAM might be, as it has a "lifetime" warranty but it is three and a half years old so maybe it's not.
The PSU is... complicated. It's technically under warranty until early next year, but EVGA gave me a brand new G5 I RMA'd near the end of the ten year warranty term to replace a G2 I had with a faulty fan (and I found the G5 is a slight downgrade but I don't know, and the G2 isn't made any more), so I wouldn't want to RMA this unless I first tried to at least RMA the part that caused the issue to show up. Honestly I'd probably just buy a new one but I'd only do this if someone was like... convinced it stood a good chance of fixing this. As the issue doesn't seem to correlated with high draw, I don't think I'm tripping OPP or OCP, but PSUs aren't my specialty.
And as a wild card, I have my prior motherboard, an Asus ROG Strix B550-F Gaming. I want to avoid having to swap that in if at all possible though due to the level of effort in entails, and because I also had separate issues with it during the time I used it (ask if you want details, but this thread is already so long I'd rather stick to this issue). I RMA'd it when a formal fault was discovered after buying my two M2 SSDs and finding one was not functional at all. Asus was slow to deal with and it cost me $50 to RMA a motherboard being sent one state away... not fun. I didn't even wait on the return and just bought the MSI to replace it (partly to reduce downtime and partly use PCI Express speeds in the second M2 port). Funny enough, the initial MSI I tried to buy never even succeeded in POSTing. I had to return that to Micro Center and the second one worked. Starting to wonder if I have a deeper issue here? But it was stable until the GPU was swapped. This one is spinning me in circles...
Any help would be greatly appreciated! I'm so desperate I'll gift you a (sanely priced) game on Steam if you can figure this one out for me. I just want it working. Imaging spending $600 to play Minecraft with shaders better and it leads to a nightmare and has you second guessing if the system was ever stable or if it's just a bad new part. It's sooo depressing. Anyway I'm going to try to RMA the GPU but if that never even gets that far or if Sapphire says it's fine, or return a new one and it also has the issue... then I won't know what to do. I feel like I've exhausted what i can and would be guessing at buying new parts at that point. But it's not worth buying new AM4/DDR4 stuff when I wanted to move to AM5 when the Zen 5 X3D launches so this would mess that up.
Last edited: