bingobangobongo
New Member
- Joined
- Aug 12, 2022
- Messages
- 10 (0.01/day)
Hello everyone.
Recently I acquired broken (yet cheap) NVIDIA Quadro K4000 3GB, with what looks like memory problems. My usecase is CUDA C++ development for fun and profit (in knowledge).
Symptoms:
Tested on Ubuntu 20.04 Server, the moment where OS enters userspace (login screen), GPU starts showing repetitive blocks of green and purple lines in boxes, identical to this image, but on larger scale. Machine works alright, I am able to login remotely using SSH, list PCI devices. "nvidia-smi" returns "No cards found", but I believe it's drivers, kernel or machine problem (as dmsg shows "NVRM: RmInitAdapter failed!"). I will test it on fresh OS install tomorrow. Artifacts show up on two other machines using Windows 7 and Windows 10 (both without drivers).
My ideas:
- BIOS modification to disable faulty memory bank
From my research, there should be possibility to disable faulty memory bank in BIOS rom, as shown in this video (RTX 2070 Super Repair (Faulty memory and error 43)). There are also threads (like this one (Where to edit fake GTX1050Ti BIOS to fix memory size?)) talking about fixing fake GPUs which show more memory than they have, would overwriting this value down trick GPU itself to not use banks in higher address space, and maybe fix the problem?
- Replacing broken memory ICs with a twist
My Quadro uses Samsung K4G20325FD-FC03, however only compatible ICs I could find locally are Samsung K4G20325FD-FC04, where this datasheet shows that they have 0.07ns slower "speed". I guess I would have to clock down memory in BIOS to make it work, if it even would work?
Conclusion and questions:
I see two ways to make this card work again, where in the latter one I could damage the card irreparably. I'm moderately confident with my hot-air station skills, however I only used it to solder SOP packages (mainly flash/eeproms).
Is it possible to modify the BIOS for this card to not use specified memory bank? If so, how? (more on second question later)
Is it possible to modify the BIOS to lower shown memory and trick the GPU to not use banks in higher address space?
Would -FC04 ICs work with this card given lower memory clock?
Additional information:
I've spent quiet some time understanding NVIDIAs BIT structure from this document, however, pointers do not make any sense. Places where they point, look random, and I could not find information about structures they point to. I also tried to understand the changes made to BIOS roms made in said threads about fake GPUs, with no luck. I saw that most of the memory related stuff is around 0x7dXX address in these examples. I would love to learn more about NVIDIAs BIOS structure, but there is no information I could find that shows how exactly things are structured. I already know I'll have to check which memory bank exactly is broken, using MATS.
I have CH341A programmer on hand, so experimenting with different roms is not a problem. I also attach BIOS rom I dumped from this card.
If You need some more information, please say. I'll provide what I can.
Recently I acquired broken (yet cheap) NVIDIA Quadro K4000 3GB, with what looks like memory problems. My usecase is CUDA C++ development for fun and profit (in knowledge).
Symptoms:
Tested on Ubuntu 20.04 Server, the moment where OS enters userspace (login screen), GPU starts showing repetitive blocks of green and purple lines in boxes, identical to this image, but on larger scale. Machine works alright, I am able to login remotely using SSH, list PCI devices. "nvidia-smi" returns "No cards found", but I believe it's drivers, kernel or machine problem (as dmsg shows "NVRM: RmInitAdapter failed!"). I will test it on fresh OS install tomorrow. Artifacts show up on two other machines using Windows 7 and Windows 10 (both without drivers).
My ideas:
- BIOS modification to disable faulty memory bank
From my research, there should be possibility to disable faulty memory bank in BIOS rom, as shown in this video (RTX 2070 Super Repair (Faulty memory and error 43)). There are also threads (like this one (Where to edit fake GTX1050Ti BIOS to fix memory size?)) talking about fixing fake GPUs which show more memory than they have, would overwriting this value down trick GPU itself to not use banks in higher address space, and maybe fix the problem?
- Replacing broken memory ICs with a twist
My Quadro uses Samsung K4G20325FD-FC03, however only compatible ICs I could find locally are Samsung K4G20325FD-FC04, where this datasheet shows that they have 0.07ns slower "speed". I guess I would have to clock down memory in BIOS to make it work, if it even would work?
Conclusion and questions:
I see two ways to make this card work again, where in the latter one I could damage the card irreparably. I'm moderately confident with my hot-air station skills, however I only used it to solder SOP packages (mainly flash/eeproms).
Is it possible to modify the BIOS for this card to not use specified memory bank? If so, how? (more on second question later)
Is it possible to modify the BIOS to lower shown memory and trick the GPU to not use banks in higher address space?
Would -FC04 ICs work with this card given lower memory clock?
Additional information:
I've spent quiet some time understanding NVIDIAs BIT structure from this document, however, pointers do not make any sense. Places where they point, look random, and I could not find information about structures they point to. I also tried to understand the changes made to BIOS roms made in said threads about fake GPUs, with no luck. I saw that most of the memory related stuff is around 0x7dXX address in these examples. I would love to learn more about NVIDIAs BIOS structure, but there is no information I could find that shows how exactly things are structured. I already know I'll have to check which memory bank exactly is broken, using MATS.
I have CH341A programmer on hand, so experimenting with different roms is not a problem. I also attach BIOS rom I dumped from this card.
If You need some more information, please say. I'll provide what I can.