Raevenlord
News Editor
- Joined
- Aug 12, 2016
- Messages
- 3,755 (1.22/day)
- Location
- Portugal
System Name | The Ryzening |
---|---|
Processor | AMD Ryzen 9 5900X |
Motherboard | MSI X570 MAG TOMAHAWK |
Cooling | Lian Li Galahad 360mm AIO |
Memory | 32 GB G.Skill Trident Z F4-3733 (4x 8 GB) |
Video Card(s) | Gigabyte RTX 3070 Ti |
Storage | Boot: Transcend MTE220S 2TB, Kintson A2000 1TB, Seagate Firewolf Pro 14 TB |
Display(s) | Acer Nitro VG270UP (1440p 144 Hz IPS) |
Case | Lian Li O11DX Dynamic White |
Audio Device(s) | iFi Audio Zen DAC |
Power Supply | Seasonic Focus+ 750 W |
Mouse | Cooler Master Masterkeys Lite L |
Keyboard | Cooler Master Masterkeys Lite L |
Software | Windows 10 x64 |
This news isn't properly today's, but it's relevant and interesting enough that I think warrants a news piece on our page. My reasoning is this: in an era where Multi-Chip Modules (MCM) and a chiplet approach to processor fabrication has become a de-facto standard for improving performance and yields, a trillion-transistor processor that eschews those modular design philosophies is interesting enough to give pause.
The Wafer Scale engine has been developed by Cerebras Systems to face the ongoing increase in demand for AI-training engines. However, in workloads where latency occur a very real impact in training times and a system's capability, Cerebras wanted to design a processor that avoided the need for a communication lane for all its cores to communicate - the system is only limited, basically, by transistors' switching times. Its 400,000 cores communicate seamlessly via interconnects, etched on 42,225 square millimeters of silicon (by comparison, NVIDIA's largest GPU is 56.7 times smaller at "just" 815 square millimeters).
However, in a world where silicon wafer manufacturing still has occurrences of manufacturing defects that can render whole chips inoperative, how did Cerebras manage to build such a large processor and keep it from having such defects that it can't actually deliver on the reported specs and performance? The answer is an old one, mainly: redundancy, paired with some additional magical engineering powders achieved in conjunction with the chips' manufacturer, TSMC. The chip is built on TSMC's 16 nm node - a more refined process with proven yields, cheaper than a cutting-edge 7 nm process, and with less areal density - this would make it even more difficult to properly cool those 400,000 cores, as you may imagine.
Cross-reticle connectivity, yield, power delivery, and packaging improvements have all been researched and deployed by Cerebras in solving the scaling problems associated with such large chips. moreover, the chips is built with redundant features that should ensure that even if some defects arise in various parts of the silicon chip, the areas that have been designed as "overprovisioning" can cut in an pick up the slack, routing and processing data without skipping a beat. Cerebras says any given component (cores, SRAM, etc) of the chip features 1%, 1.5% of additional overprovisioning capability that enables any manufacturing defects to be just a negligible speedbump instead of a silicon-waster.
The inter-core communication solution is one of the most advanced ever seen, with a fine-grained, all-hardware, on-chip mesh-connected communication network dubbed Swarm that delivers an aggregate bandwidth of 100 petabits per second.. this is paired with 18 Gb of local, distributed, superfast SRAM memory as the one and only level of the memory hierarchy - delivering memory bandwidth in the realm of 9 petabytes per second.
The 400,000 cores are custom-designed for AI workload acceleration. Named SLAC for Sparse Linear Algebra Cores, these are flexible, programmable, and optimized for the sparse linear algebra that underpins all neural network computation (think of these as FPGA-like, programmable arrays of cores). SLAC's programmability ensures cores can run all neural network algorithms in the constantly changing machine learning field - this is a chip that can adapt to different workloads and AI-related problem solving and training - a requirement for such expensive deployments as the Wafer Scale Engine will surely pose.
The entire chip and its accompanying deployment apparatus had to be developed in-house. As founder and CEO Andrew Feldman puts it, there were no packaging, printed circuit boards, connectors, cold plates, tools or any software that could be adapted towards the manufacturing and deployment of the Wafer Scale Engine. This means that Cerebras Systems' and its team of 173 engineers had to develop not only the chip, but almost everything else that is needed to make sure it actually works. The Wafer Scale Engine consumes 15 kilowatts of power to operate - a prodigious amount of power for an individual chip, although relatively comparable to a modern-sized AI cluster. This is a cluster, in essence, but deployed in a solo chip with none of the latency and inter-chip communication hassles that plague clusters.
In an era where companies are looking towards chiplet design and inter-chip communication solutions as ways to tackle the increasing challenges of manufacturing density and decreasing yields, Cerebras' effort proves that there is still a way of developing monolithic chips that place performance above all other considerations.
View at TechPowerUp Main Site
The Wafer Scale engine has been developed by Cerebras Systems to face the ongoing increase in demand for AI-training engines. However, in workloads where latency occur a very real impact in training times and a system's capability, Cerebras wanted to design a processor that avoided the need for a communication lane for all its cores to communicate - the system is only limited, basically, by transistors' switching times. Its 400,000 cores communicate seamlessly via interconnects, etched on 42,225 square millimeters of silicon (by comparison, NVIDIA's largest GPU is 56.7 times smaller at "just" 815 square millimeters).
However, in a world where silicon wafer manufacturing still has occurrences of manufacturing defects that can render whole chips inoperative, how did Cerebras manage to build such a large processor and keep it from having such defects that it can't actually deliver on the reported specs and performance? The answer is an old one, mainly: redundancy, paired with some additional magical engineering powders achieved in conjunction with the chips' manufacturer, TSMC. The chip is built on TSMC's 16 nm node - a more refined process with proven yields, cheaper than a cutting-edge 7 nm process, and with less areal density - this would make it even more difficult to properly cool those 400,000 cores, as you may imagine.
Cross-reticle connectivity, yield, power delivery, and packaging improvements have all been researched and deployed by Cerebras in solving the scaling problems associated with such large chips. moreover, the chips is built with redundant features that should ensure that even if some defects arise in various parts of the silicon chip, the areas that have been designed as "overprovisioning" can cut in an pick up the slack, routing and processing data without skipping a beat. Cerebras says any given component (cores, SRAM, etc) of the chip features 1%, 1.5% of additional overprovisioning capability that enables any manufacturing defects to be just a negligible speedbump instead of a silicon-waster.
The inter-core communication solution is one of the most advanced ever seen, with a fine-grained, all-hardware, on-chip mesh-connected communication network dubbed Swarm that delivers an aggregate bandwidth of 100 petabits per second.. this is paired with 18 Gb of local, distributed, superfast SRAM memory as the one and only level of the memory hierarchy - delivering memory bandwidth in the realm of 9 petabytes per second.
The 400,000 cores are custom-designed for AI workload acceleration. Named SLAC for Sparse Linear Algebra Cores, these are flexible, programmable, and optimized for the sparse linear algebra that underpins all neural network computation (think of these as FPGA-like, programmable arrays of cores). SLAC's programmability ensures cores can run all neural network algorithms in the constantly changing machine learning field - this is a chip that can adapt to different workloads and AI-related problem solving and training - a requirement for such expensive deployments as the Wafer Scale Engine will surely pose.
The entire chip and its accompanying deployment apparatus had to be developed in-house. As founder and CEO Andrew Feldman puts it, there were no packaging, printed circuit boards, connectors, cold plates, tools or any software that could be adapted towards the manufacturing and deployment of the Wafer Scale Engine. This means that Cerebras Systems' and its team of 173 engineers had to develop not only the chip, but almost everything else that is needed to make sure it actually works. The Wafer Scale Engine consumes 15 kilowatts of power to operate - a prodigious amount of power for an individual chip, although relatively comparable to a modern-sized AI cluster. This is a cluster, in essence, but deployed in a solo chip with none of the latency and inter-chip communication hassles that plague clusters.
In an era where companies are looking towards chiplet design and inter-chip communication solutions as ways to tackle the increasing challenges of manufacturing density and decreasing yields, Cerebras' effort proves that there is still a way of developing monolithic chips that place performance above all other considerations.
View at TechPowerUp Main Site