Cerebras Systems' Wafer Scale Engine is a Trillion Transistor Processor in a 12" Wafer

Raevenlord · Aug 27, 2019

This news isn't properly today's, but it's relevant and interesting enough that I think warrants a news piece on our page. My reasoning is this: in an era where Multi-Chip Modules (MCM) and a chiplet approach to processor fabrication has become a de-facto standard for improving performance and yields, a trillion-transistor processor that eschews those modular design philosophies is interesting enough to give pause.

The Wafer Scale engine has been developed by Cerebras Systems to face the ongoing increase in demand for AI-training engines. However, in workloads where latency occur a very real impact in training times and a system's capability, Cerebras wanted to design a processor that avoided the need for a communication lane for all its cores to communicate - the system is only limited, basically, by transistors' switching times. Its 400,000 cores communicate seamlessly via interconnects, etched on 42,225 square millimeters of silicon (by comparison, NVIDIA's largest GPU is 56.7 times smaller at "just" 815 square millimeters).

However, in a world where silicon wafer manufacturing still has occurrences of manufacturing defects that can render whole chips inoperative, how did Cerebras manage to build such a large processor and keep it from having such defects that it can't actually deliver on the reported specs and performance? The answer is an old one, mainly: redundancy, paired with some additional magical engineering powders achieved in conjunction with the chips' manufacturer, TSMC. The chip is built on TSMC's 16 nm node - a more refined process with proven yields, cheaper than a cutting-edge 7 nm process, and with less areal density - this would make it even more difficult to properly cool those 400,000 cores, as you may imagine.

Cross-reticle connectivity, yield, power delivery, and packaging improvements have all been researched and deployed by Cerebras in solving the scaling problems associated with such large chips. moreover, the chips is built with redundant features that should ensure that even if some defects arise in various parts of the silicon chip, the areas that have been designed as "overprovisioning" can cut in an pick up the slack, routing and processing data without skipping a beat. Cerebras says any given component (cores, SRAM, etc) of the chip features 1%, 1.5% of additional overprovisioning capability that enables any manufacturing defects to be just a negligible speedbump instead of a silicon-waster.

The inter-core communication solution is one of the most advanced ever seen, with a fine-grained, all-hardware, on-chip mesh-connected communication network dubbed Swarm that delivers an aggregate bandwidth of 100 petabits per second.. this is paired with 18 Gb of local, distributed, superfast SRAM memory as the one and only level of the memory hierarchy - delivering memory bandwidth in the realm of 9 petabytes per second.

The 400,000 cores are custom-designed for AI workload acceleration. Named SLAC for Sparse Linear Algebra Cores, these are flexible, programmable, and optimized for the sparse linear algebra that underpins all neural network computation (think of these as FPGA-like, programmable arrays of cores). SLAC's programmability ensures cores can run all neural network algorithms in the constantly changing machine learning field - this is a chip that can adapt to different workloads and AI-related problem solving and training - a requirement for such expensive deployments as the Wafer Scale Engine will surely pose.

The entire chip and its accompanying deployment apparatus had to be developed in-house. As founder and CEO Andrew Feldman puts it, there were no packaging, printed circuit boards, connectors, cold plates, tools or any software that could be adapted towards the manufacturing and deployment of the Wafer Scale Engine. This means that Cerebras Systems' and its team of 173 engineers had to develop not only the chip, but almost everything else that is needed to make sure it actually works. The Wafer Scale Engine consumes 15 kilowatts of power to operate - a prodigious amount of power for an individual chip, although relatively comparable to a modern-sized AI cluster. This is a cluster, in essence, but deployed in a solo chip with none of the latency and inter-chip communication hassles that plague clusters.

In an era where companies are looking towards chiplet design and inter-chip communication solutions as ways to tackle the increasing challenges of manufacturing density and decreasing yields, Cerebras' effort proves that there is still a way of developing monolithic chips that place performance above all other considerations.

View at TechPowerUp Main Site

Vya Domus · Aug 27, 2019

Impressive but still, putting these things in the same category with other monolithic GPUs and CPUs is a stretch.

Basard · Aug 27, 2019

Can it play Crysis?

AleksandarK · Aug 27, 2019

Truly impressive.

I do wonder how will system integration work, however. The chip is quite large and integrating something like that on a PCB would be difficult. Also, expansion of the chip is quite possible due to the huge amount of heat. Can't wait to see how will they solve those problems

Dinnercore · Aug 27, 2019

AleksandarK said:
Truly impressive.

I do wonder how will system integration work, however. The chip is quite large and integrating something like that on a PCB would be difficult. Also, expansion of the chip is quite possible due to the huge amount of heat. Can't wait to see how will they solve those problems

From what I have read they are already in use and they had to make power delivery with vertical copper planes because a flat pcb can not support the current within thermal specs. The cooling comes from several, also vertical high pressure water streams.

fynxer · Aug 27, 2019

This is truly an advancement, managing to do something everyone has been trying to crack since dawn of wafer manufacturing.

And it is not a simple solution either since they not only had to solve the problem at hand but also design new advanced tools and software to actually pull it off.

They also already manufactured wafers and are ready to introduce their manufacturing process to the world.

Often when you hear about new stuff like this it is only a working theory on the drawing board with 10-15 years work before final product.

15 kilowatt is a little hot BUT imagine this tech on 5nm in the future with 3 kilowatt.

Bet they already working in 3D stacking these monsters

Kohl Baas · Aug 27, 2019

Funny thing is, tha cooling of this chip will be the easyer part. Since this is a totally custom solution, they just integrate whatever cooling solution they want into the package. Let it be water or gas. I would do it with a gass solution with compressor and an option to use the excess heat-energy to actually heat the building.

Vayra86 · Aug 27, 2019

Very impressive

Wavetrex · Aug 27, 2019

And so, Skynet was born.

A bit bigger in size than what we've seen on the Big Screen, but give it time and it will fit in a T-800's head.

john_ · Aug 27, 2019

There are so many companies creating chips for AI, that I wonder if Nvidia really has a future in this with GPUs, because GPUs are not specifically made for AI. I don't mean a 2-3 years future, but 5-10 years.

Kohl Baas · Aug 27, 2019

Wavetrex said:
And so, Skynet was born.

A bit bigger in size than what we've seen on the Big Screen, but give it time and it will fit in a T-800's head.

Skynet is not fitting in anything, because it's not a hardware. You can't actually see the Skynet, all the movies featuring merely the instruments it can controll.

By the story of the 3rd episode, the problem happens when Skynet is "geting out" to the internet, gaining a huge amount of compute power by "infecting" all connected devices and becoming self-conscious.

Vya Domus · Aug 27, 2019

john_ said:
Nvidia really has a future in this with GPUs, because GPUs are not specifically made for AI. I don't mean a 2-3 years future, but 5-10 years.

Nvidia is already prototyping their own dedicated AI chips. That ought to answer your question.

Deep Dive: Nvidia Inference Research Chip Scales to 32 Chiplets

Nvidia has presented its RC 18 research chip for inference acceleration. The small 16nm chip with RISC-V core scales to 32 units in a package, delivering over 100 TOPS at over 100W.

www.tomshardware.com

NJM1564 · Aug 27, 2019

Basard said:
Can it play Crysis?

Not even close.

Basard · Aug 27, 2019

NJM1564 said:
Not even close.

Give it a few more years, I guess.... :laugh:

phanbuey · Aug 28, 2019

how does one feed data to such a monster...

interested to see how they will provide the bandwidth this needs in order to process data at capacity.

halo9 · Aug 28, 2019

Basard said:
Can it play Crysis?

But would you want it to? With that much AI processing power it’d be practically unbeatable.

biffzinker · Aug 28, 2019

phanbuey said:
how does one feed data to such a monster...

interested to see how they will provide the bandwidth this needs in order to process data at capacity.

The enourmous bandwidth to feed the cores stays on die.

this is paired with 18 Gb of local, distributed, superfast SRAM memory as the one and only level of the memory hierarchy - delivering memory bandwidth in the realm of 9 petabytes per second.

phanbuey · Aug 28, 2019

biffzinker said:
The enourmous bandwidth to feed the cores stays on die.

But how do you feed the die? Once it's in the die it's fine... but at 9 petabytes per second and only 18GB - something is gotta connect to it. Would be interesting to see what that is.

Brusfantomet · Aug 28, 2019

phanbuey said:
But how do you feed the die? Once it's in the die it's fine... but at 9 petabytes per second and only 18GB - something is gotta connect to it. Would be interesting to see what that is.

Remember that the 9 petabyte is internally on the die.

At the moment AI research may be done in a GPU with 8 GiB to 24 GiB Ram, the complete dataset might not fit in the GPU ram, so it will be done in batches.
The same way the data sets might be loaded into the internal 18 GiB memory for the new beast.

To compare a Radeon VII, it has 3840 shading units, and 1 TB/s memory access to its 16 GiB on-board Ram. This new chip has basically moved all that onto one chip, with 9 000 x the access speed and 100 x the number of cores.
A modern day GPU doing AI would be feed by the PCIe bus, a gen 4 at 16 x would be capable of 128 GB/s, since this is a basic data dump (from system memory if you wish to sustain that speed for all of the 16 GB to the GPU) it requires little to no computation and approximately 125 ms of write time.

The same way, to fill the 18 GiB of on-board memory could be accomplished in less than 5 seconds from a PCIe x 4 gen 4 NVME drive. If your computation takes 20 minutes that is not the big problem.

yotano211 · Aug 28, 2019

Can it run a Prius?

Steevo · Aug 28, 2019

This is the new future of computing, all on a single die, I'm sure a lot of those transistors are fast math accelerated paths. A few of these and we will have AI that is closer to human than supercomputing.

System Name	The Ryzening
Processor	AMD Ryzen 9 5900X
Motherboard	MSI X570 MAG TOMAHAWK
Cooling	Lian Li Galahad 360mm AIO
Memory	32 GB G.Skill Trident Z F4-3733 (4x 8 GB)
Video Card(s)	Gigabyte RTX 3070 Ti
Storage	Boot: Transcend MTE220S 2TB, Kintson A2000 1TB, Seagate Firewolf Pro 14 TB
Display(s)	Acer Nitro VG270UP (1440p 144 Hz IPS)
Case	Lian Li O11DX Dynamic White
Audio Device(s)	iFi Audio Zen DAC
Power Supply	Seasonic Focus+ 750 W
Mouse	Cooler Master Masterkeys Lite L
Keyboard	Cooler Master Masterkeys Lite L
Software	Windows 10 x64

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	ChoreBoy
Processor	8700k Delided
Motherboard	Gigabyte Z390 Master
Cooling	420mm Custom Loop
Memory	CMK16GX4M2B3000C15 2x8GB @ 3000Mhz
Video Card(s)	EVGA 1080 SC
Storage	1TB SX8200, 250GB 850 EVO, 250GB Barracuda
Display(s)	Pixio PX329 and Dell E228WFP
Case	Fractal R6
Audio Device(s)	On-Board
Power Supply	1000w Corsair
Software	Win 10 Pro
Benchmark Scores	A million on everything....

System Name	FATTYDOVE-R-SPEC
Processor	Intel i9 10980XE
Motherboard	EVGA X299 Dark
Cooling	Water (1x 240mm, 1x 280mm, 1x 420mm + 2x Mo-Ra 360 external radiator)
Memory	64GB DDR4
Video Card(s)	RTX 2080 Super / RTX 3090
Storage	Crucial MX500
Display(s)	24", 1440p, freesync, 144hz
Case	Open Benchtable (OBT)
Audio Device(s)	beyerdynamic MMX 300
Power Supply	EVGA Supernova T2 1600W
Mouse	OG steelseries Sensei
Keyboard	steelseries 6Gv2
Software	Windows 10

System Name	My Addiction
Processor	AMD Ryzen 7950X3D
Motherboard	ASRock B650E PG-ITX WiFi
Cooling	Alphacool Core Ocean T38 AIO 240mm
Memory	G.Skill 32GB 6000MHz
Video Card(s)	Sapphire Pulse 7900XTX
Storage	Some SSDs
Display(s)	42" Samsung TV + 22" Dell monitor vertically
Case	Lian Li A4-H2O
Audio Device(s)	Denon + Bose
Power Supply	Corsair SF750
Mouse	Logitech
Keyboard	Glorious
VR HMD	None
Software	Win 10
Benchmark Scores	None taken

Cerebras Systems' Wafer Scale Engine is a Trillion Transistor Processor in a 12" Wafer

Raevenlord

News Editor

Vya Domus

Basard

AleksandarK

News Editor

Dinnercore

fynxer

Kohl Baas

Vayra86

Wavetrex

john_

Kohl Baas

Vya Domus

Deep Dive: Nvidia Inference Research Chip Scales to 32 Chiplets

NJM1564

Basard

phanbuey

halo9

New Member

biffzinker

phanbuey

Brusfantomet

yotano211

Steevo

System Name	Tiny the White Yeti
Processor	7800X3D
Motherboard	MSI MAG Mortar b650m wifi
Cooling	CPU: Thermalright Peerless Assassin / Case: Phanteks T30-120 x3
Memory	32GB Corsair Vengeance 30CL6000
Video Card(s)	ASRock RX7900XT Phantom Gaming
Storage	Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s)	Gigabyte G34QWC (3440x1440)
Case	Lian Li A3 mATX White
Audio Device(s)	Harman Kardon AVR137 + 2.1
Power Supply	EVGA Supernova G2 750W
Mouse	Steelseries Aerox 5
Keyboard	Lenovo Thinkpad Trackpoint II
VR HMD	HD 420 - Green Edition ;)
Software	W11 IoT Enterprise LTSC
Benchmark Scores	Over 9000

System Name	3 desktop systems: Gaming / Internet / HTPC
Processor	Ryzen 5 7600 / Ryzen 5 4600G / Ryzen 5 5500
Motherboard	X670E Gaming Plus WiFi / MSI X470 Gaming Plus Max (1) / MSI X470 Gaming Plus Max (2)
Cooling	Snowman / Segotep T4 / Νoctua U12S
Memory	Kingston FURY Beast 32GB DDR5 6000 / 16GB JUHOR / 32GB G.Skill RIPJAWS 3600 + Aegis 3200
Video Card(s)	ASRock RX 6600+GTX 1660 / Vega 7 integrated / Radeon RX 580+GTX 1050
Storage	NVMes, ONLY NVMes / NVMes, SATA Storage / NVMe, SATA, external storage
Display(s)	Philips 43PUS8857/12 UHD TV (120Hz, HDR, FreeSync Premium) / 19'' HP monitor + BlitzWolf BW-V5
Case	Sharkoon Rebel 12 / CoolerMaster Elite 361 / Xigmatek Midguard
Audio Device(s)	onboard
Power Supply	Chieftec 850W / Silver Power 400W / Sharkoon 650W
Mouse	CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Keyboard	CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Software	Windows 10 / Windows 10&Windows 11 / Windows 10

System Name	stress-less
Processor	9800X3D @ 5425 MHZ
Motherboard	MSI PRO B650M-A Wifi
Cooling	Thermalright Phantom Spirit EVO (Intake)
Memory	64GB DDR5 6200 1:1 CL30-36-36, FCLK 2067
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	65% HE Keyboard
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

Processor	Intel Core i7-13700 PL2 150W
Motherboard	MSI Z790 Gaming Plus WiFi
Cooling	Cooler Master Hyper 212 Halo Black
Memory	G Skill F5-6800J3446F48G 96GB kit
Video Card(s)	Gigabyte Radeon RX 9070 GAMING OC 16G
Storage	970 EVO NVMe 500GB, WD850N 2TB
Display(s)	Samsung 28” 4K monitor
Case	Corsair iCUE 4000D RGB AIRFLOW
Audio Device(s)	EVGA NU Audio, Edifier Bookshelf Speakers R1280
Power Supply	TT TOUGHPOWER GF A3 Gold 1050W
Mouse	Logitech G502 Hero
Keyboard	Logitech G G413 Silver
Software	Windows 11 Professional v24H2

System Name	Games/internet/usage
Processor	I7 5820k 4.2 Ghz
Motherboard	ASUS X99-A2
Cooling	custom water loop for cpu and gpu
Memory	16GiB Crucial Ballistix Sport 2666 MHz
Video Card(s)	Radeon Rx 6800 XT
Storage	Samsung XP941 500 GB + 1 TB SSD
Display(s)	Dell 3008WFP
Case	Caselabs Magnum M8
Audio Device(s)	Shiit Modi 2 Uber -> Matrix m-stage -> HD650
Power Supply	beQuiet dark power pro 1200W
Mouse	Logitech MX518
Keyboard	Corsair K95 RGB
Software	Win 10 Pro

System Name	MSI GP76
Processor	intel i7 11800h
Cooling	2 laptop fans
Memory	32gb of 3000mhz DDR4
Video Card(s)	Nvidia 3070
Storage	x2 PNY 8tb cs2130 m.2 SSD--16tb of space
Display(s)	17.3" IPS 1920x1080 240Hz
Power Supply	280w laptop power supply
Mouse	Logitech m705
Keyboard	laptop keyboard
Software	lots of movies and Windows 10 with win 7 shell
Benchmark Scores	Good enough for me

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.