News Posts matching #Overheating

Return to Keyword Browsing

NVIDIA "Blackwell" NVL72 Servers Reportedly Require Redesign Amid Overheating Problems

According to The Information, NVIDIA's latest "Blackwell" processors are reportedly encountering significant thermal management issues in high-density server configurations, potentially affecting deployment timelines for major tech companies. The challenges emerge specifically in NVL72 GB200 racks housing 72 GB200 processors, which can consume up to 120 kilowatts of power per rack, weighting a "mere" 3,000 pounds (or about 1.5 tons). These thermal concerns have prompted NVIDIA to revisit and modify its server rack designs multiple times to prevent performance degradation and potential hardware damage. Hyperscalers like Google, Meta, and Microsoft, who rely heavily on NVIDIA GPUs for training their advanced language models, have allegedly expressed concerns about possible delays in their data center deployment schedules.

The thermal management issues follow earlier setbacks related to a design flaw in the Blackwell production process. The problem stemmed from the complex CoWoS-L packaging technology, which connects dual chiplets using RDL interposer and LSI bridges. Thermal expansion mismatches between various components led to warping issues, requiring modifications to the GPU's metal layers and bump structures. A company spokesperson characterized these modifications as part of the standard development process, noting that a new photomask resolved this issue. The Information states that mass production of the revised Blackwell GPUs began in late October, with shipments expected to commence in late January. However, these timelines are unconfirmed by NVIDIA, and some server makers like Dell confirmed that these GB200 NVL72 liquid-cooled systems are shipping now, not in January, with CoreWave GPU cloud provider as a customer. The original report could be using older information, as Dell is one of NVIDIA's most significant partners and among the first in the supply chain to gain access to new GPU batches.

AMD Radeon RX 7900 XTX May Feature Faulty Coolers, Causing Overheating

AMD's latest GPUs have been reported to be experiencing overheating issues, with many users claiming that the vapor chamber cooler works better in a vertical rather than a horizontal position. Regardless of orientation, vapor chamber coolers should equal roughly the same heat dissipation performance and move the heat away from the source; however, testing showed that some reference AMD Radeon RX 7900 XTX GPUs feature defect coolers. According to the testing conducted by Roman "der8auer" Hartung, AMD's Radeon RX 7900 XTX RDNA3 GPUs are experiencing problems with overheating caused by a faulty vapor chamber design.

What der8auer found is that these coolers could have a defect in the manufacturing process, where the liquid inside the vapor chamber faces problems in circulation after condensation. It could relate to manufacturing issues of the cooler itself, with an inadequate amount of fluid or insufficient pressure inside the chamber. For more in-depth testing and performance benchmarks, see the video below. It is important to note that we didn't see other reports that replicate this behavior, so always take these reports with a dash of salt.

Tesla to Patch 130,000 Cars with AMD Ryzen APUs Due to Overheating

One of the electric vehicle driving forces in the car market, Tesla, has today announced that the company would have to issue a soft recall of a few select car models over an issue with overheating. The affected vehicles are Tesla Model 3 2022, Tesla Model S 2021-2022, Tesla Model X 2021-2022, and Tesla Model Y 2022. Tesla's infotainment system is powered by AMD Ryzen APUs, replacing the Intel Atom CPUs found in the previous models. With Ryzen APUs overheating, the infotainment system can lag or restart and sometimes cause it to get completely turned off. The problem is that the car's liquid cooling will prioritize cooling the batteries instead of the processor, causing it to overheat. Tesla issued a soft recall on these models, meaning that a regular firmware update will fix this issue.
TeslaTesla, Inc. (Tesla) is recalling certain 2021-2022 Model S, Model X, and 2022 Model 3 and Model Y vehicles operating certain firmware releases. The infotainment central processing unit (CPU) may overheat during the preparation or process of fast-charging, causing the CPU to lag or restart. A lagging or restarting CPU may prevent the center screen from displaying the rearview camera image, gear selection, windshield visibility control settings, and warning lights, increasing the risk of a crash. Tesla will perform an over-the-air (OTA) software update that will improve CPU temperature management, free of charge. Owner notification letters are expected to be mailed July 1, 2022. Owners may contact Tesla customer service at 1-877-798-3752. Tesla's number for this recall is SB-22-00-009.

EVGA GTX 1070/1080 Overheating Issues Update - New BIOS Revision To Be Released

After reports of EVGA cards overheating and sometimes becoming non-operational, which we covered right here on TPU, the company has now issued a statement further clarifying the steps it's taking towards solving the issues. Though it was first reported that only the GTX 1070/1080 FTW series of cards were having issues, the company has also extended its efforts towards the GTX 1060 cards, in both 3 GB and 6 GB flavors, which may point to either underlying problems with those cards as well, or simply EVGA extending that bit of extra support to their customers.

While at first it seemed that the company-distributed, free-of-charge thermal pads (which EVGA stressed were optional in nature) would be enough to fix any and all issues, the company is also issuing a BIOS revision in a few days, which "adjusts the fan speed curve" to "ensure sufficient cooling of all components across all operating temperatures".

EVGA GTX 1070/1080 Overheating Issues - Company Says Thermal Pads A Solution

After users' reports (and Tom's Hardware.de testing) of EVGA FTW 1080 and 1070 cards displaying black screen issues, and sometimes even sparking and dying altogether, even at stock voltage, the company is now moving towards fixing the issue.

Apparently, the issue stems from the absence of any thermal pads over the VRM area of the FTW line of cards, which prompts higher operating temperatures. Some users were reporting heat transfer in such quantities that even the GDDR5X memory chips on the cards were being heated at 107 ºC, significantly over their rated operating temperatures of (0°C ≤ TC ≤ +95°C).
Return to Keyword Browsing
Dec 22nd, 2024 00:16 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts