Freezing in games with new Radeon GPU, MCE-WHEA CPU Bus Error

Caring1 · Aug 5, 2024

Bytorane said:
So after one more try with Fortnite, and a WHEA crash. No crashes in Apex Legends (30 min max). I was able to borrow a PSU, a Seasonic Vertex ...something.. I think 850W, that was bought new 6 months ago.

And it's a major difference in reported voltages. Before I switched the PSU, I made a short log with HWiNFO so I'll compare some of those other voltages for CPU and Mobo which I didn't pay attention that much yet, as I still have no deep familiarity what is nominal, auto OC, or extreme, etc. That's the next learning curve.

View attachment 357553

View attachment 357554

The first picture shows that PSU is failing to supply the required voltages and needs to be replaced.

Bytorane · Aug 5, 2024

tabascosauz said:
I will say, though, that if your WHEAs are

consistently happening

always WHEA 19 Bus/Interconnect and never with WHEA 18 Cache Hierarchy in the mix

then the problem is coming from uncore components on your CPU, and there's not much else to say. ie. either any part of the Infinity Fabric interconnect, or the memory controller. I guess it could be the fault of the board, but it's a bit of a remote chance - it's not WHEA 18 which can often be due to the board running a bad voltage curve on the CPU.

WHEA 19 is pretty much always an issue of mismatched voltages (VSOC, VDDGs, VDDP(?), PLL(?)) for what you are asking of the memory controller/IF, or your RAM config is just asking too much of your specific CPU sample's memory controller/IF. Proven bad CPUs (that just have weak Fabric and throw WHEA 19s at relatively low mem/Fabric speeds) stay bad when slotted into different boards.

I only had one auto-reboot where a detailed WHEA event was generated by "WHEA-Logger", otherwise it was mostly freeze-BSODs with dumps for which a standard Kernel-Power unexpected shutdown and bugcheck events are created, which don't mention WHEA. But there's also more WHEA-Logger entries in the Operational log as "info" types and there's more of them but with vague information which doesn't help, didn't figure out what those are about, if there's more WHEA stuff going in the background.

I've analyzed every dump and I think I'm sure I never saw anything else other than a BUS/Interconnect being mentioned. But WinDBG doesn't report "WHEA 19". Where did you get that from, the Event presumably?

Zach_01 · Aug 5, 2024

Bytorane said:
But WinDBG doesn't report "WHEA 19". Where did you get that from, the Event presumably?

Exactly and for your convenience you create a custom view, specific to WHEA and easier to find

tabascosauz · Aug 5, 2024

Bytorane said:
I only had one auto-reboot where a detailed WHEA event was generated by "WHEA-Logger", otherwise it was mostly freeze-BSODs with dumps for which a standard Kernel-Power unexpected shutdown and bugcheck events are created, which don't mention WHEA. But there's also more WHEA-Logger entries in the Operational log as "info" types and there's more of them but with vague information which doesn't help, didn't figure out what those are about, if there's more WHEA stuff going in the background.

I've analyzed every dump and I think I'm sure I never saw anything else other than a BUS/Interconnect being mentioned. But WinDBG doesn't report "WHEA 19". Where did you get that from, the Event presumably?

Sorry, memory getting rusty. Seems like Event ID refers to something else, can't figure out what. Gotta go into event details and read Error Type.

Point about error types still stands - if you see a wild mix of different error types (Bus/Interconnect, cache hierarchy, parity check, or even PCIE related) then the board and other components might be more suspect. If not, focus on memory/Fabric/memory controller.

Bytorane · Aug 5, 2024

Zach_01 said:
Exactly and for your convenience you create a custom view, specific to WHEA and easier to find

View attachment 357626

I guess Event Viewer works a bit awkward in that under Application and Service Logs -> Microsoft -> Windows there doesn't appear to be those log sources that are under Administrative Events or Windows Logs present there, there's usually the Admin and Operational logs for some similarly named log sources but they don't have the same kind of events.

Or I'm missing something? For example if I go to Microsoft->Windows->Kernel-WHEA->Errors ... there is one entry, but it's holding basically no information and it's not the same entry that's in Administrative Events which has a lot more info about the WHEA event, it's the one I pasted earlier into this thread. The neighbouring Operational log has more WHEA informational events, without much info.

So I guess the intended? usage is to use custom made filters to get to what I'm trying to do, but they could have just provided a log source view and list all logs from that source.

tabascosauz said:
Sorry, memory getting rusty. Seems like Event ID refers to something else, can't figure out what. Gotta go into event details and read Error Type.

Point about error types still stands - if you see a wild mix of different error types (Bus/Interconnect, cache hierarchy, parity check, or even PCIE related) then the board and other components might be more suspect. If not, focus on memory/Fabric/memory controller.

Right, I've always typod "!errrec ffff9c87103a2028" (ffff9c87103a2028 being bugcheck P2 and always the same) into WinDBG to get more info and always got "Error Type : BUS error" and "Error : BUSL1_SRC_IRD_I_NOTIMEOUT_ERR (Proc 10 Bank 1)" So I think it's consistently BUS Error, no matter what game/program.

Right, the Event ID is Event Viewers classification of the type of event, but the reported data says Error Type = 10.

tabascosauz · Aug 5, 2024

Bytorane said:
I guess Event Viewer works a bit awkward in that under Application and Service Logs -> Microsoft -> Windows there doesn't appear to be those log sources that are under Administrative Events or Windows Logs present there, there's usually the Admin and Operational logs for some similarly named log sources but they don't have the same kind of events.

Or I'm missing something? For example if I go to Microsoft->Windows->Kernel-WHEA->Errors ... there is one entry, but it's holding basically no information and it's not the same entry that's in Administrative Events which has a lot more info about the WHEA event, it's the one I pasted earlier into this thread. The neighbouring Operational log has more WHEA informational events, without much info.

So I guess the intended? usage is to use custom made filters to get to what I'm trying to do, but they could have just provided a log source view and list all logs from that source.

Right, I've always typod "!errrec ffff9c87103a2028" (ffff9c87103a2028 being bugcheck P2 and always the same) into WinDBG to get more info and always got "Error Type : BUS error" and "Error : BUSL1_SRC_IRD_I_NOTIMEOUT_ERR (Proc 10 Bank 1)" So I think it's consistently BUS Error, no matter what game/program.

Right, the Event ID is Event Viewers classification of the type of event, but the reported data says Error Type = 10.

Download HCI Memtest, or Testmem5, or Karhu and memtest so we know that's not part of the crashing
Go into BIOS and up your VSOC a bit to see if it alleviates the WHEAs. Up to 1.2V max.
If not combine it with higher VDDGs. VDDG_CCD should be okay at 1.0V max, VDDG_IOD should be 0.05V below VSOC (ie. 1.10V for 1.15V VSOC)

Zach_01 · Aug 6, 2024

@Bytorane
Not sure how much familiar you are about a Ryzen's chiplet configuration but here it is

Vsoc: entire SoC/IO die voltage
VDDP_: UCLK/UMC voltage (Unified Memory Controller voltage) (derived from Vsoc)
VDDG CCD_: FCLK (InfinityFabric) voltage on the Cores chiplet(s) side (derived from Vsoc)
VDDG IOD_: FCLK (InfinityFabric) voltage on the SoC/IO die side (derived from Vsoc)

Bus/Interconnect errors are directly related with those CPU components and sometimes with DRAM also (Board DRAM-CPU traces quality/noise may play a role too)

Bytorane · Aug 6, 2024

A few clarifications:

I have yet to do a test in Fortnite by the way, I had to spend a few hours searching for a replacement PSU and other chores. I think I've made a decision and finished ordering. Now back to testing.

I'm currently still attached to the borrowed PSU with stable voltages. I have to move slowly step-by-step which I think is a good troubleshooting practice, so I have not changed any other BIOS settings yet and first only testing with the replaced PSU and I'm still running 2 DIMMS (2x32GB) with BIOS DRAM Frequency set at 1866MT. That is the only BIOS setting that has been changed in relation to performance.

I can't test that much all day long as it involves 30-60 minute play sessions and I can't do it for 10-hour per day sessions, I'm only doing perhaps 2-4 hours per day due to other work/chores (no teenage times anymore when we could rock all day long) and watching sensors, analyzing data and documenting this here also takes a significant chunk of time. I could leave DCS unattended, but online games need someone to sit with it, and spectating won't work either as it's will throw you out if AFK.

If I'm lucky, we'll see soon if it's only the PSUs fault, but I'll get back to it later in the day. In the meantime yesterday I was wondering about the HWiNFO Power Reporting Deviation and read something about it but don't understand yet why it's so high in idle, around 200-300%, it gets lower under stress according to the logging I've done for some of the previous tests with Apex Legends (under the same conditions described above, but Apex never crashed for the 20-30min sessions I had so far)

I didn't had time to analyze the HWiNFO logs deeper yet, maybe someone can give it a try while I'm busy in the meantime.

Zach_01 · Aug 6, 2024

PRD only has a meaning at 100% CPU load. Not 98%, not 99%... Only at full load 100%.
Need an all core load strictly like cinebench, prime95 or any kind of rendering, compression and so.
Not gaming.
Otherwise its meaningless and doesn't show anything.

Bytorane · Aug 6, 2024

HCI Memtest's weird, can't allocate more than 4000 GB, a bit barebones and with 128GB I'm suppose to open ... ~30 instances, (no time for that right now), TestMEM5 website broken won't download, and for Kharu paid only?

I'll just run a standard traditional bootable memtest from passmark or whatever it is, later overnight, already have that on a USB from 5 months ago.

tabascosauz · Aug 6, 2024

Bytorane said:
HCI Memtest's weird, can't allocate more than 4000 GB, a bit barebones and with 128GB I'm suppose to open ... ~30 instances, (no time for that right now), TestMEM5 website broken won't download, and for Kharu paid only?

I'll just run a standard traditional bootable memtest from passmark or whatever it is, later overnight, already have that on a USB from 5 months ago.

Fetch TM5 from overclock.net, i've never used the github spinoff version. The OCN version is hosted on mega and still works just fine.
HCI Memtest free version has the capacity limitation. You are supposed to open multiple instances and run them simultaneously until you almost max out your memory.
Karhu is the paid option of the 3, yes. TM5 is the most straightforward.

The reason we use these 3 and not anything else is because anything else branded as a "memory test" is pretty much garbage . To produce relevant results you have to hammer as much of your memory as possible, as hard as possible, in addition to your memory controller and other CPU uncore components. MT86 and other memtests don't understand what "intensive" means so they are out of the picture entirely. Same reason why in addition to TM5/HCI it can be important to run additional, potentially even tougher stress tests like certain y-cruncher configurations or Linpack extreme (that do more or less a similar job) to essentially get a meaningful second opinion.

You don't need to be around for memtesting. Leave your PC on and leave it running. HCI goes on infinitely. TM5 anta777 or 1usmusv3 configs should only take 1-2hr max.

RJARRRPCGP · Aug 7, 2024

Caring1 said:
The first picture shows that PSU is failing to supply the required voltages and needs to be replaced.

Suspect that PSU has bad caps!

tabascosauz said:
Fetch TM5 from overclock.net, i've never used the github spinoff version. TM5 anta777 or 1usmusv3 configs should only take 1-2hr max.

The Github-spinoff version is broken! It stopped, but the GUI still kept counting the elapsed time, IIRC. It doesn't terminate correctly, IIRC.

Even when "Bus/Interconnect Error" is not a usual IMC-related issue. IIRC, that's IF instability, from what I gathered. I suggest that the OP goes down to 32 GB and see if that fatal machine check exception is still being thrown. It that's still the case, then it looks like a faulty IF on the CPU.

tabascosauz · Aug 7, 2024

RJARRRPCGP said:
Suspect that PSU has bad caps!

The Github-spinoff version is broken! It stopped, but the GUI still kept counting the elapsed time, IIRC. It doesn't terminate correctly, IIRC.

Even when "Bus/Interconnect Error" is not a usual IMC-related issue. IIRC, that's IF instability, from what I gathered. I suggest that the OP goes down to 32 GB and see if that fatal machine check exception is still being thrown. It that's still the case, then it looks like a faulty IF on the CPU.

Yup bus/interconnect is usually just IF, but the UMC load here is quite high so I wasn't gonna say for sure. There's not a whole lot of reason outside of extreme OC to talk about the UMC separately from Fabric on AM4 as they are matched in speed (and at which speed the UMC is just breezing along), but it can still take a big hit from quad rank.

Since the only way to get 128GB on AM4 is with quad rank 16Gb ICs, I don't think 32GB (literally 1 stick) would be necessary or very illuminating. A dual rank 2x32GB setup like OP is running right now already should eliminate the UMC load concerns

RJARRRPCGP · Aug 7, 2024

tabascosauz said:
Yup bus/interconnect is usually just IF, but the UMC load here is quite high so I wasn't gonna say for sure.

OTOH, I suspect my Ryzen 7 5800X (non-"3D" version) has a dud IMC or less tolerant to heat! Because I suddenly got "MEMORY_MANAGEMENT" BSOD from Windows 11 on May 18, 2023. (22H2, IIRC)

But, with the same RAM, TM5+anta777, passed on my Ryzen 7 5800X3D! Even at 3600, in 2024! I remember after getting that BSOD, I shut down the PC, took the G.Skill TridentZ Neo and left it sitting somewhere for months and months!

RJARRRPCGP said:
OTOH, I suspect my Ryzen 7 5800X (non-"3D" version) has a dud IMC or less tolerant to heat! Because I suddenly got "MEMORY_MANAGEMENT" BSOD from Windows 11 on May 18, 2023. (22H2, IIRC)

But, with the same RAM, TM5+anta777, passed on my Ryzen 7 5800X3D! Even at 3600, in 2024! I remember after getting that BSOD, I shut down the PC, took the G.Skill TridentZ Neo and left it sitting somewhere for months and months!

Then I continued working with the PC as the daily-driver with 2x8 GB sticks, with AData Spectrix D41. Never a BSOD again. It was with the Windows 11 22H2 installation from September 20, 2022, which was my longest-lasting Windows installation of recent, until I ran into an unexpected malvertising trap in very-early February, 2024. Then I had to wipe the SSD and change passwords!
I simply couldn't trust that Windows 11 installation anymore!

I still was using the September 20, 2022 Windows 11 22H2 installation for the Ryzen 9 5900X upgrade during Christmas, 2023.

Bytorane · Aug 7, 2024

tabascosauz said:
The reason we use these 3 and not anything else is because anything else branded as a "memory test" is pretty much garbage . To produce relevant results you have to hammer as much of your memory as possible, as hard as possible, in addition to your memory controller and other CPU uncore components. MT86 and other memtests don't understand what "intensive" means so they are out of the picture entirely. Same reason why in addition to TM5/HCI it can be important to run additional, potentially even tougher stress tests like certain y-cruncher configurations or Linpack extreme (that do more or less a similar job) to essentially get a meaningful second opinion.

Indeed, wow ... thinking about all of this, it has infact crossed my mind yesterday when I started MT86 test, that this kind of memtest doesn't really strain the system overall much, because the fans don't really spin up much at all, unlike what games do. And when I did a CPU test with Prime95 it was again either mostly CPU or RAM, but in no case GPU strain.

So I've ended the MT86 after 13 hours, with 2x32GB under DOCP 3600MT and it finished without errors, but kinda expected and as we can see, might not mean much.

I meant to say, I can't be around that much for game testing, which is where the crashing and errors happen the most.

Okay, I've regrouped and now spammed the HCI Memtest instances, found a quick way how to get around the prompts fast ... I wanted to buy the Pro version but after seeing the payment options I kinda didn't want to fuss with that right now.

No errors after a 10 hour HCI Memtest, with 64GB RAM and 3600MT DOCP

tabascosauz · Aug 8, 2024

Bytorane said:
Indeed, wow ... thinking about all of this, it has infact crossed my mind yesterday when I started MT86 test, that this kind of memtest doesn't really strain the system overall much, because the fans don't really spin up much at all, unlike what games do. And when I did a CPU test with Prime95 it was again either mostly CPU or RAM, but in no case GPU strain.

So I've ended the MT86 after 13 hours, with 2x32GB under DOCP 3600MT and it finished without errors, but kinda expected and as we can see, might not mean much.

I meant to say, I can't be around that much for game testing, which is where the crashing and errors happen the most.

Okay, I've regrouped and now spammed the HCI Memtest instances, found a quick way how to get around the prompts fast ... I wanted to buy the Pro version but after seeing the payment options I kinda didn't want to fuss with that right now.

No errors after a 10 hour HCI Memtest, with 64GB RAM and 3600MT DOCP

Good stuff.

Is the 1.075V VSOC just natural droop from 1.1V set in BIOS?

1.1 VSOC should be more than adequate with just 2x32GB at 3600. Not too much else to say, settings look okay.

Bytorane · Aug 8, 2024

I played a 40 minute session of Fortnite before and then a 20-30 minute session after enabling DOCP ... but with 2 DIMMS. And no WHEA crashes or anything.

I never got to the step of manually adjusting voltages in BIOS, if it does end up crashing I will ofcourse get to that point.

I don't want to jump to conclusions but it does start to look like it was a bad PSU issue.

Now I don't want to stop troubleshooting and testing, I'd appreciate if people stuck around and push me to further optimize this setup, even if very preliminary results show WHEAs don't seem to happen (as quickly as they did before*).
I don't necessairly have to run with it most of the time, but when I'm rendering/processing or playing something where I'd want to get more kick for a few hours.
I'm just not all that familiar with these modern OC options as much yet, I'd probably will go watch some basics around uncore components and memory, I'm a bit lost on details how these relationships work, besides, I didn't have an AMD CPU/Motherboard since 2008.

Glancing at some of the information I found, I see that VSOC adjustment could help with stability when pushing memory frequencies, ... but I'll get back more into it in a day or two when time allows.

BTW: Excuse me on some of my attitude, didn't want to charge out like that to say bad things about those memtests earlier, I got a lot of things going on.
However, the multiple-instance approach, while the author says it's even better because it would saturate CPU threads, some of the HCI memtest instances were disproportionately disadvantaged, only managing to barely hit 100% coverage while others (which started ealier) were at 300-400%

tabascosauz said:
Is the 1.075V VSOC just natural droop from 1.1V set in BIOS?

1.1 VSOC should be more than adequate with just 2x32GB at 3600. Not too much else to say, settings look okay.

There's a slight difference in reported powers and temperatures under BIOS and Windows, but I haven't focused on comparing most of them yet as it's to be expected, althought some of them probably shouldn't drop.

I was focusing on drop for the PSU voltages in the meantime, and they were rock stable whether under max load or not, same as reported in BIOS.

Now I did so after you reminded me now:

Configuration:
BIOS VDDCR CPU = 1.408 Auto
BIOS VDDCR SOC = 1.100 Auto
BIOS DRAM = 1.350 DOCP
1.0V SB = 1.0 Auto
1.2V SB = 1.2 Auto
CPU 1.8V = 1.8 Auto
BIOS VTTDR = 0.675 Auto
VPP_MEM = 2.500 Auto
VDDP Standby = 0.900 Auto

I forgot to check the sensor monitoring in BIOS for the above configurations, before I started a traditional Memtest86+ v7, correction, I have the opensource one, not the Passmark one.
I'll post them later once the 128GB DOCP 3600MT Memtest soaks for a few hours, before I get back to other tests.

tabascosauz · Aug 8, 2024

Bytorane said:
I played a 40 minute session of Fortnite before and then a 20-30 minute session after enabling DOCP ... but with 2 DIMMS. And no WHEA crashes or anything.

I never got to the step of manually adjusting voltages in BIOS, if it does end up crashing I will ofcourse get to that point.

I don't want to jump to conclusions but it does start to look like it was a bad PSU issue.

Now I don't want to stop troubleshooting and testing, I'd appreciate if people stuck around and push me to further optimize this setup, even if very preliminary results show WHEAs don't seem to happen (as quickly as they did before*).
I don't necessairly have to run with it most of the time, but when I'm rendering/processing or playing something where I'd want to get more kick for a few hours.
I'm just not all that familiar with these modern OC options as much yet, I'd probably will go watch some basics around uncore components and memory, I'm a bit lost on details how these relationships work, besides, I didn't have an AMD CPU/Motherboard since 2008.

Glancing at some of the information I found, I see that VSOC adjustment could help with stability when pushing memory frequencies, ... but I'll get back more into it in a day or two when time allows.

BTW: Excuse me on some of my attitude, didn't want to charge out like that to say bad things about those memtests earlier, I got a lot of things going on.
However, the multiple-instance approach, while the author says it's even better because it would saturate CPU threads, some of the HCI memtest instances were disproportionately disadvantaged, only managing to barely hit 100% coverage while others (which started ealier) were at 300-400%

There's a slight difference in reported powers and temperatures under BIOS and Windows, but I haven't focused on comparing most of them yet as it's to be expected, althought some of them probably shouldn't drop.

I was focusing on drop for the PSU voltages in the meantime, and they were rock stable whether under max load or not, same as reported in BIOS.

Now I did so after you reminded me now:

Configuration:
BIOS VDDCR CPU = 1.408 Auto
BIOS VDDCR SOC = 1.100 Auto
BIOS DRAM = 1.350 DOCP
1.0V SB = 1.0 Auto
1.2V SB = 1.2 Auto
CPU 1.8V = 1.8 Auto
BIOS VTTDR = 0.675 Auto
VPP_MEM = 2.500 Auto
VDDP Standby = 0.900 Auto

I forgot to check the sensor monitoring in BIOS for the above configurations, before I started a traditional Memtest86+ v7, correction, I have the opensource one, not the Passmark one.
I'll post them later once the 128GB DOCP 3600MT Memtest soaks for a few hours, before I get back to other tests.

No problem, it's easy to understand where the memtesting misconceptions come from. If ever you want to/have to do anything with memory beyond XMP, you'll personally very quickly find out that those classic bootable memtesting "tools" aren't good for much aside from verifying very obvious hardware DIMM failures.

I bought a HCI licence a long time ago, but also haven't used HCI in a long time. It's demanding enough for most people and good enough for a second opinion, but it's slow, more suited for overnight testing. TM5 is just better and more practical, frankly.

I would hesitate to say that VSOC helps with stability, because memory stability is more complicated than that. The statement isn't wrong, but VSOC is of limited usefulness in itself. It is of primary importance when dealing with Bus/Interconnect, but even then you only really get a very limited range to work with (up to 1.2V, and that's being generous as going past 1.15V might not actually help).

For you, VSOC has added relevance because on 4x32GB you will run into UMC load concerns that the vast majority of people do not. But the 1.2V limit still stands, so the point stands.

On AM4, memory stability has 3 facets:

DRAM stability: the actual stability of the memory profile you're running, in relation to certain settings and the VDIMM (DRAM voltage) you have set.
UMC stability: whether the memory controller is happy with the memory running the way it is. Usually just concerned with VSOC.
Fabric stability: whether the Infinity Fabric is happy with the speed it's running at, with VSOC and the VDDGs it has been given.

For most AM4 users, #2 is really not much concern and #1 and #2 go together.
Testing #1 (and #2 technically) is mainly TM5/HCI/Karhu's domain.
#3 doesn't really have a very clear and dedicated stress test. The ultimate test for #3 is just time and usage. But you can try things like Prime95 Large FFT (some others I've forgotten) and other memory-related tests that are very uncore-heavy (ie. high SOC Power draw in HWInfo, high memory usage/stress, high indicated CPU read/write bandwidth in HWInfo).

Although #2 becomes a standalone concern for AM5, the dynamics stay more or less the same. #2 gains in importance (although TM5 still handles #1 and #2), while #3 decreases in importance (due to how the decoupled FCLK works) but is still a pain to stabilize at the higher end (ie. 2200+).

Good to hear things are better now. Give it some time before calling it solved. WHEAs on AM4 have a habit of appearing only when you least expect them to - I spent quite some time with a rather diseased 3700X to have that experience myself.

Bytorane · Aug 9, 2024

Just two quick clarifications, I was referring to DOCP when I said earlier "I don't need to keep running with it all the time", and to be more precise, I got the new system in late 2020, but then the kovid thing happened and GPUs got so expensive so everything's got so stretched out with this build, I swapped two CPUs in between before 5900X, one RAM and oneGPU, so I pretty much have a high-end AMD AM4 machine for a year and most of that time too busy with other chores/life stuff. I think the PSU is like 2016 or 2018 ish ... I didn't found the receipts yet, but yeah, we optimally hope these things to last 10 years, and this would be 6-8 years now, so it should be fair, but I think warranty was 5 on this one if I have to guess, idk.

HCI Memtest will keep running thorugh the night and into the afternoon tomorrow, before I get back to this. How long is reasonable tho, I can and will rerun these tests once I get back to putting the machine together with the new PSU anyway, right now it's barebone motherboard sitting on a desk at another location with the borrowed PSU, now the last thing I need is someone tripping on some wires or spilling something on the board, so might be better to pause in a few days and just wait for the PSU offline.

Update:

HCI memtest keeps running fine with 128GB with DOCP 3600MT, around 24 hours now, is it enough? How long it's worthwhile going usually?

tabascosauz · Aug 9, 2024

Bytorane said:
Just two quick clarifications, I was referring to DOCP when I said earlier "I don't need to keep running with it all the time", and to be more precise, I got the new system in late 2020, but then the kovid thing happened and GPUs got so expensive so everything's got so stretched out with this build, I swapped two CPUs in between before 5900X, one RAM and oneGPU, so I pretty much have a high-end AMD AM4 machine for a year and most of that time too busy with other chores/life stuff. I think the PSU is like 2016 or 2018 ish ... I didn't found the receipts yet, but yeah, we optimally hope these things to last 10 years, and this would be 6-8 years now, so it should be fair, but I think warranty was 5 on this one if I have to guess, idk.

HCI Memtest will keep running thorugh the night and into the afternoon tomorrow, before I get back to this. How long is reasonable tho, I can and will rerun these tests once I get back to putting the machine together with the new PSU anyway, right now it's barebone motherboard sitting on a desk at another location with the borrowed PSU, now the last thing I need is someone tripping on some wires or spilling something on the board, so might be better to pause in a few days and just wait for the PSU offline.

Update:

HCI memtest keeps running fine with 128GB with DOCP 3600MT, around 24 hours now, is it enough? How long it's worthwhile going usually?

Usually 1000%+ coverage is a decent indicator of long term stability. I think I used to run up to 1600% or so.

But tbh TM5 anta777 or 1usmusv3 gives me more confidence just within its default 30min-2hr runtime. And I change the config so it runs overnight.

RJARRRPCGP · Aug 10, 2024

tabascosauz said:
Good to hear things are better now. Give it some time before calling it solved. WHEAs on AM4 have a habit of appearing only when you least expect them to - I spent quite some time with a rather diseased 3700X to have that experience myself.

What happened to your Ryzen 7 3700X?! My 3700X, would just throw "Cache Hierarchy Error" sometimes. But it was likely to do it in a warm room. (especially >71F)

Bytorane · Aug 10, 2024

Right after finishing HCI memtest, I've ran TestMem5 1usmusv3 for 9+ hours, and that's 128GB with DOCP 3600MT, no problem, it maxed out the RAM and CPUs, no errors.

But ... oh well, right after that I ran a 30 minute Fortnite match, WHEA, same thing, AuthenticAMD.sys BUS Error.

I got all I needed from the borrowed PSU/GPU, I've disassembled the setup, now I need to cleanup and redo the primary PC case, reshuffle a few stuff around the place preparing for new PSU, technically in the meantime I could keep testing with Seasonic FOCUS PX-750W which is in my secondary PC, but the new PSU is scheduled to arrive in 3 days so perhaps I might take a break.

The next obvious big test is to try RX480 GPU again and see if that make a difference.

A Computer Guy · Aug 10, 2024

Bytorane said:
Right after finishing HCI memtest, I've ran TestMem5 1usmusv3 for 9+ hours, and that's 128GB with DOCP 3600MT, no problem, it maxed out the RAM and CPUs, no errors.

But ... oh well, right after that I ran a 30 minute Fortnite match, WHEA, same thing, AuthenticAMD.sys BUS Error.

I got all I needed from the borrowed PSU/GPU, I've disassembled the setup, now I need to cleanup and redo the primary PC case, reshuffle a few stuff around the place preparing for new PSU, technically in the meantime I could keep testing with Seasonic FOCUS PX-750W which is in my secondary PC, but the new PSU is scheduled to arrive in 3 days so perhaps I might take a break.

The next obvious big test is to try RX480 GPU again and see if that make a difference.

With the 128GB ram installed did you try downclocking it to DDR4-3200 yet?

Bytorane · Aug 11, 2024

So I had the time and decided to swap motherboards in my secondary PC, and went for the RX480 along the way.

Current setup is the same Win10 installation, now with a Seasonic Focus PX-750W PSU, the Radeon RX480 GPU and I kept 128GB@3600MT DOCP.

I tried Apex Legends for a ~60 minute session, but you know one match doesn't last 60 mins, so there's pauses in system load during matchmaking/respawning/lobby.
Then I ran Crysis Remastered first mission for 20 minutes, and after that a built-in GPU and CPU benchmark, launching the provided BAT files. Each of the benchmarks took aroun 10 minutes. I'll see if I can configure a higher amount of repetitions (default 4) of what is a 2-3 minute run.
I also did a quick try of DOOM 2016 for 5 minutes.

No issues at all here.

Later today if I have time I'll try Fortnite again. By the way, Fortnite always froze during active gameplay, and all games did for that matter, nothing ever froze in menus or paused.

Update:

It gets worse.
Unexpected and Unexplained shutdown when updating DCS World to the latest patch with the release of CH-47. What the hell ...

... !!!!
....

... after checking Event Viewer for the third time:

The process C:\DCS World\bin\DCS_updater.exe (REDACTED) has initiated the restart of computer REDACTED on behalf of user REDACTED\REDACTED for the following reason: Application: Installation (Planned)
Reason Code: 0x80040002
Shut-down Type: restart
Comment: OS restart is required before running the application.

Seriously, I don't think I've seen anything like this in a very long time, no warning, no notification, nothing. I'm sure the devs would agree this is not the best UX and some mistake. I launched the update via the brand new launcher, that might have something to do with it.
Apparently due to a VC++ Redist component update, but that wasn't the VC++ Redist installer, it was intentional by the updater. Still kinda half-half of the huge necessity of a reboot, many times software does this and they don't need rebots, and with games, Steam does first-time setups with various component installs and preparations and never* (I never seen it) needs a restart, and many times with VC or Framework the installers have "noreboot" option.

Luckly I wasn't doing anything too important at the time (I'm in general not during troubleshooting)

False alarm I guess, but it was a demoralizing few moments, I literally thought about, if this is so broken, might just sell the whole PC and go to AM5.

Bytorane · Aug 12, 2024

Update 2:

So with Sapphire AMD Radeon RX480 Nitro+ OC 8GB (Polaris), and latest driver from this year (seems to be quite recent):
Fortnite 10 hour session, +10 matches, no WHEA, no crash yet.

... but also

... means at worst case better replacing PC than chugging with AM4 or what.

Surprising and not surprising .... looking more like a CPU or Motherboard problem doesn't it ?

Looking up more stuff, people reported a defect in CPU being the fault.

5950x WHEA Errors (New PC Build)

Hi Everyone! I just built a new PC and I'm getting random WHEA UNCORRECTABLE ERROR(s). Sometimes it happens immediately on startup. Sometimes it will happen after an hour of gaming. This error has happened to me at least 20 times in 2 days. The performance is great, but my system is not stable...

community.amd.com

Ryzen 5900x: System constantly crashing/restarting WHEA-Logger ID 18 and critical error Kernel-Power

Mainboard: MSI x570 Unify Mainboard-BIOS: 7C35vA82 (Beta version) CPU: Ryzen 5900x RAM: Crucial Ballistix BL2K32G36C16U4B 3600 MHz, 64GB (32GB x2) Drive: M.2 Samsung 970 Evo+ 1TB SSD Graphics: SAPPHIRE Nitro+ Radeon RX 5700 XT PSU: be quiet straight power 11 750w Platinum OS: Win 10 Pro (64bit)...

community.amd.com

WHEA Logger 18 reboot while playing a specific game - Ryzen 3800x

Hello everyone, I'm really frustrated with my PC, it will reboot at random when playing EVE online, after and hour or two. It never happens while playing any other game, I've: updated all drivers updated BIOS Reinstalled the game Removed XMP and set BIOS to defaults Ran memtest86 off a usb stick...

community.amd.com

Next tests: Nvidia 1070 ... I hoped if I had a more modern Nvidia GPU, it might only strain the CPU/Mobo similarly to RX480, though I didn't lookup on how that series compares to RX480
After that: Switch CPU

System Name	Lenovo ThinkCentre
Processor	AMD 5650GE
Motherboard	Lenovo
Memory	32 GB DDR4
Display(s)	AOC 24" Freesync 1m.s. 75Hz
Mouse	Lenovo
Keyboard	Lenovo
Software	W11 Pro 64 bit

System Name	PC on since Aug 2019, 1st CPU R5 3600 + ASUS ROG RX580 8GB >> MSI Gaming X RX5700XT (Jan 2020)
Processor	Ryzen 9 5900X (July 2022), 163W PPT limit, 80C temp limit, CO -8~12
Motherboard	Gigabyte X570 Aorus Pro (Rev1.0), BIOS F37h, AGESA V2 1.2.0.B
Cooling	Arctic Liquid Freezer II 420mm Rev7 (Jan 2024) with off center mount for Ryzen, TIM: Kryonaut
Memory	2x16GB G.Skill Trident Z Neo GTZN (July 2022) 3600MT/s 1.38V CL16-16-16-16-32-48 1T, tRFC:280, B-die
Video Card(s)	Sapphire Nitro+ RX 7900XTX (Dec 2023) 314~466W (366W current) PowerLimit, 1060mV, Adrenalin v24.7.1
Storage	Samsung NVMe: 980Pro 1TB(OS 2022), 970Pro 512GB(2019) / SATA-III: 850Pro 1TB(2015) 860Evo 1TB(2020)
Display(s)	Dell Alienware AW3423DW 34" QD-OLED curved (1800R), 3440x1440 144Hz (max 175Hz) HDR400/1000, VRR on
Case	None... naked on desk
Audio Device(s)	Astro A50 headset
Power Supply	Corsair HX750i, 80+ Platinum, 93% (250~700W), modular, single/dual rail (switch)
Mouse	Logitech MX Master (Gen1)
Keyboard	Logitech G15 (Gen2) w/ LCDSirReal applet
Software	Windows 11 Home 64bit (v23H2, OSBuild 22631.4037), upgraded from Win10 to Win11 on Feb 2024

System Name	ab┃ob
Processor	7800X3D┃5800X3D
Motherboard	B650E PG-ITX┃X570 Impact
Cooling	NH-U12A + T30┃AXP120-x67
Memory	64GB 6400CL32┃32GB 3600CL14
Video Card(s)	RTX 4070 Ti Eagle┃RTX A2000
Storage	8TB of SSDs┃1TB SN550
Case	Caselabs S3┃Lazer3D HT5

System Name	ab┃ob
Processor	7800X3D┃5800X3D
Motherboard	B650E PG-ITX┃X570 Impact
Cooling	NH-U12A + T30┃AXP120-x67
Memory	64GB 6400CL32┃32GB 3600CL14
Video Card(s)	RTX 4070 Ti Eagle┃RTX A2000
Storage	8TB of SSDs┃1TB SN550
Case	Caselabs S3┃Lazer3D HT5

System Name	PC on since Aug 2019, 1st CPU R5 3600 + ASUS ROG RX580 8GB >> MSI Gaming X RX5700XT (Jan 2020)
Processor	Ryzen 9 5900X (July 2022), 163W PPT limit, 80C temp limit, CO -8~12
Motherboard	Gigabyte X570 Aorus Pro (Rev1.0), BIOS F37h, AGESA V2 1.2.0.B
Cooling	Arctic Liquid Freezer II 420mm Rev7 (Jan 2024) with off center mount for Ryzen, TIM: Kryonaut
Memory	2x16GB G.Skill Trident Z Neo GTZN (July 2022) 3600MT/s 1.38V CL16-16-16-16-32-48 1T, tRFC:280, B-die
Video Card(s)	Sapphire Nitro+ RX 7900XTX (Dec 2023) 314~466W (366W current) PowerLimit, 1060mV, Adrenalin v24.7.1
Storage	Samsung NVMe: 980Pro 1TB(OS 2022), 970Pro 512GB(2019) / SATA-III: 850Pro 1TB(2015) 860Evo 1TB(2020)
Display(s)	Dell Alienware AW3423DW 34" QD-OLED curved (1800R), 3440x1440 144Hz (max 175Hz) HDR400/1000, VRR on
Case	None... naked on desk
Audio Device(s)	Astro A50 headset
Power Supply	Corsair HX750i, 80+ Platinum, 93% (250~700W), modular, single/dual rail (switch)
Mouse	Logitech MX Master (Gen1)
Keyboard	Logitech G15 (Gen2) w/ LCDSirReal applet
Software	Windows 11 Home 64bit (v23H2, OSBuild 22631.4037), upgraded from Win10 to Win11 on Feb 2024

Freezing in games with new Radeon GPU, MCE-WHEA CPU Bus Error

Caring1

Bytorane

Zach_01

tabascosauz

Moderator

Bytorane

tabascosauz

Moderator

Zach_01

Bytorane

Attachments

Zach_01

Bytorane

tabascosauz

Moderator

RJARRRPCGP

tabascosauz

Moderator

RJARRRPCGP

Bytorane

Attachments

tabascosauz

Moderator

Bytorane

tabascosauz

Moderator

Bytorane

tabascosauz

Moderator

RJARRRPCGP

Bytorane

A Computer Guy

Bytorane

Bytorane

5950x WHEA Errors (New PC Build)

Ryzen 5900x: System constantly crashing/restarting WHEA-Logger ID 18 and critical error Kernel-Power

WHEA Logger 18 reboot while playing a specific game - Ryzen 3800x

System Name	KHR-1
Processor	Ryzen 9 5900X
Motherboard	ASRock B550 PG Velocita (UEFI-BIOS P3.40)
Memory	32 GB G.Skill RipJawsV F4-3200C16D-32GVR
Video Card(s)	Sapphire Nitro+ Radeon RX 6750 XT
Storage	Western Digital Black SN850 1 TB NVMe SSD
Display(s)	Alienware AW3423DWF OLED-ASRock PG27Q15R2A (backup)
Case	Corsair 275R
Audio Device(s)	Technics SA-EX140 receiver with Polk VT60 speakers
Power Supply	eVGA Supernova G3 750W
Mouse	Logitech G Pro (Hero)
Software	Windows 11 Pro x64 23H2

System Name	Not a thread ripper but pretty good.
Processor	Ryzen 9 5950x
Motherboard	ASRock X570 Taichi (revision 1.06, BIOS/UEFI version P5.50)
Cooling	EK-Quantum Velocity, EK-Quantum Reflection PC-O11, EK-CoolStream PE 360, XSPC TX360
Memory	Micron DDR4-3200 ECC Unbuffered Memory (4 sticks, 128GB, 18ASF4G72AZ-3G2F1)
Video Card(s)	XFX Radeon RX 5700 & EK-Quantum Vector Radeon RX 5700 +XT & Backplate
Storage	Samsung 2TB & 4TB 980 PRO, 2TB 970 EVO Plus, 2 x Optane 905p 1.5TB (striped), AMD Radeon RAMDisk
Display(s)	2 x 4K LG 27UL600-W (and HUANUO Dual Monitor Mount)
Case	Lian Li PC-O11 Dynamic Black (original model)
Power Supply	Corsair RM750x
Mouse	Logitech M575
Keyboard	Corsair Strafe RGB MK.2
Software	Windows 10 Professional (64bit)
Benchmark Scores	Typical for non-overclocked CPU.