Wednesday, March 6th 2024
Dr. Lisa Su Responds to TinyBox's Radeon RX 7900 XTX GPU Firmware Problems
The TinyBox AI server system attracted plenty of media attention last week—its creator, George Hotz, decided to build with AMD RDNA 3.0 GPU hardware rather than the expected/traditional choice of CDNA 3.0. Tiny Corp. is a startup firm dealing in neural network frameworks—they currently "write and maintain tinygrad." Hotz & Co. are in the process of assembling rack-mounted 12U TinyBox systems for customers—an individual server houses an AMD EPYC 7532 processor and six XFX Speedster MERC310 Radeon RX 7900 XTX graphics cards. The Tiny Corp. social media account has engaged in numerous NVIDIA vs. AMD AI hardware debates/tirades—Hotz appears to favor the latter, as evidenced in his latest choice of components. ROCm support on Team Red AI Instinct accelerators is fairly mature at this point in time, but a much newer prospect on gaming-oriented graphics cards.
Tiny Corporation's unusual leveraging of Radeon RX 7900 XTX GPUs in a data center configuration has already hit a developmental roadblock. Yesterday, the company's social media account expressed driver-related frustrations in a public forum: "If AMD open sources their firmware, I'll fix their LLVM spilling bug and write a fuzzer for HSA. Otherwise, it's not worth putting tons of effort into fixing bugs on a platform you don't own." Hotz's latest complaint was taken onboard by AMD's top brass—Dr. Lisa Su responded with the following message: "Thanks for the collaboration and feedback. We are all in to get you a good solution. Team is on it." Her software engineers—within a few hours—managed to fling out a set of fixes in Tiny Corporation's direction. Hotz appreciated the quick turnaround, and proceeded to run a model without encountering major stability issues: "AMD sent me an updated set of firmware blobs to try. They are responsive, and there have been big strides in the driver in the last year. It will be good! This training run is almost 5 hours in, hasn't crashed yet." Tiny Corp. drummed up speculation about AMD open sourcing GPU MES firmware—Hotz disclosed that he will be talking (on the phone) to Team Red leadership.
Sources:
Lisa Su Tweet, Tom's Hardware
Tiny Corporation's unusual leveraging of Radeon RX 7900 XTX GPUs in a data center configuration has already hit a developmental roadblock. Yesterday, the company's social media account expressed driver-related frustrations in a public forum: "If AMD open sources their firmware, I'll fix their LLVM spilling bug and write a fuzzer for HSA. Otherwise, it's not worth putting tons of effort into fixing bugs on a platform you don't own." Hotz's latest complaint was taken onboard by AMD's top brass—Dr. Lisa Su responded with the following message: "Thanks for the collaboration and feedback. We are all in to get you a good solution. Team is on it." Her software engineers—within a few hours—managed to fling out a set of fixes in Tiny Corporation's direction. Hotz appreciated the quick turnaround, and proceeded to run a model without encountering major stability issues: "AMD sent me an updated set of firmware blobs to try. They are responsive, and there have been big strides in the driver in the last year. It will be good! This training run is almost 5 hours in, hasn't crashed yet." Tiny Corp. drummed up speculation about AMD open sourcing GPU MES firmware—Hotz disclosed that he will be talking (on the phone) to Team Red leadership.
24 Comments on Dr. Lisa Su Responds to TinyBox's Radeon RX 7900 XTX GPU Firmware Problems
So did AI fail to test the vbios :eek:
But yea, at least Jacket Lady (supposedly) lit a fire & got 'er done, supposedly....
But publishing that gaming GPUs come with bugs that crash the simulation, is in no way a PR stunt, it's a disaster, even if a fix is build from AMD and applied in a matter of hours.
I would stay away from a solution like this after reading that "crashing" word and consider AMD next year. I mean, we talk how people shouldn't want to become beta testers for Intel's gaming GPUs, why would someone spend 15K to become a beta tester for Tinybox and AMD?
This is a good sign because if AMD actually allows more access, it can make the RX 7900 XTX a viable alternative to the RTX 4090 (and RTX 3090) for ML/AI usage.
For what it's worth, you can access the scheduler and block scheduling directly (DMA) on an NVIDIA card through CUDA API calls.
It's also not great optics that TinyBox hasn't even fully validated their systems before starting marketing sales, and is now trying to put the blame on AMD for their lack of proper long-term testing. They're literally trying to just get around paying out the arm for enterprise-grade CDNA cards by going with consumer-grade RDNA cards. And they're not even reference models straight from AMD (which are at least guaranteed to work because it IS reference, but AIB customs which may have their own quirks due to out-of-the-box OC'ing or tweaked components on the card vs Reference. They announced their new product just last month and started initial sales, but haven't even begun testing anything until now and realizing there's some teething issues in using consumer products in an enterprise environment.
If anything, TinyBox should be happy that AMD is even bothering to give them the Enterprise-level treatment of priority service. AMD could just have put them in the prosumer queue for using consumer cards in an enterprise environment. It's not like TinyBox could go anywhere either; Nvidia would demand they cease-and-desist and switch to their enterprise models since they're selling to potential enterprise/datacenter clients (even though running AI via CUDA is permitted on their consumer cards, but only at the consumer/prosumer level), while Intel would shrug and tell them to go all Intel for their All Intel AI PC program in which Intel will begin incorporating AI elements into their CPUs (and IIRC, they don't have a dedicated accelerator equivalent to Instinct or the Nvidia equivalent yet).
GeoHotz is legit.
Nothing wrong with AMD consumer cards being able to run AI; it's great for the home user or prosumer looking to utilize AI to assist them. However, expecting them to run like enterprise is stupid, and expecting to be treated like enterprise customers is even stupider, considering that's literally what the enterprise segment is for. Even Nvidia doesn't care if their GeForce gaming cards are used for AI in small environments, but they definitely will not provide any support to run it at the enterprise scale, directing customers to their enterprise line up in the first place.
At any rate, it's not even ROCm the guy is working with based on his TwiX posts, but something more exotic that requires access to the proprietary core code, which he wants open-sourced. Going by comments elsewhere, it's supposedly the kind of code that could make or break AMD's GPUs, so it's no surprise they wouldn't want to open source that, and why they chose to put his team in connection with a high level engineer to work out a custom fix. As for ROCm on 7900XTX and 7900XT, AMD is admittedly is kind of overdue on updating it; since the last major update was over 6 months back.
As well, if AI ends up following a similar path as cryptocurrency, there will eventually be near-consumer level AI accelerators developed at some point to handle most of the work that could beat a consumer card many times over. Assuming economies of scale don't just see that kind of tech end up in consumer GPUs anyway for future games; proclaiming smarter enemies, smarter allies, smarter wildlife, the return of PhysX in some manner, or advanced realistic sound generation and artificial surround sound (assuming AI in games doesn't become another novelty like PhysX; underutilized then killed because it takes up too much dev time).
I was under the impression company's typically use dedicated hardware and individuals typically use gaming cards?
And I don't follow individuals or AI so I wouldn't know.
In any case 4090 is used by many for AI and that's the reason it is one of those cards restricted for export to China.