Friday, October 21st 2022

IBM Artificial Intelligence Unit (AIU) Arrives with 23 Billion Transistors

IBM Research has published information about the company's latest development of processors for accelerating Artificial Intelligence (AI). The latest IBM processor, called the Artificial Intelligence Unit (AIU), embraces the problem of creating an enterprise solution for AI deployment that fits in a PCIe slot. The IBM AIU is a half-height PCIe card with a processor powered by 23 Billion transistors manufactured on a 5 nm node (assuming TSMC's). While IBM has not provided many details initially, we know that the AIU uses an AI processor found in the Telum chip, a core of the IBM Z16 mainframe. The AIU uses Telum's AI engine and scales it up to 32 cores and achieve high efficiency.

The company has highlighted two main paths for enterprise AI adoption. The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver similar result. The other one is, as IBM touts, that "AI chip should be laid out to streamline AI workflows. Because most AI calculations involve matrix and vector multiplication, our chip architecture features a simpler layout than a multi-purpose CPU. The IBM AIU has also been designed to send data directly from one compute engine to the next, creating enormous energy savings."
In the sea of AI accelerators, IBM hopes to differentiate its offerings by having an enterprise chip to solve more complex problems than current AI chips target. "Deploying AI to classify cats and dogs in photos is a fun academic exercise. But it won't solve the pressing problems we face today. For AI to tackle the complexities of the real world—things like predicting the next Hurricane Ian, or whether we're heading into a recession—we need enterprise-quality, industrial-scale hardware. Our AIU takes us one step closer. We hope to soon share news about its release," says the official IBM release.
Source: IBM Research
Add your own comment

15 Comments on IBM Artificial Intelligence Unit (AIU) Arrives with 23 Billion Transistors

#1
BorisDG
Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.
Posted on Reply
#2
ZetZet
BorisDGSounds not impressive to me. When you think about a small nail sized A16 has 16 billion.
This is probably similar size, just has a heat spreader, because they can push the chip instead of power limiting it.
Posted on Reply
#3
lemonadesoda
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
Posted on Reply
#4
Ferrum Master
lemonadesodathat deviation can be ENORMOUS.
Maybe it is for weather forecasting? Could explain a lot :D
Posted on Reply
#6
GuiltySpark
lemonadesodaI don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
I agree on this but you cannot demand TPU to explain everything related to pro / cons of quantization of NN, there is an amount of theory for an entire degree course on that. Just only saying there is, in a classical NN, a space to explore that is still incredible high even using int8 (int4 as well) will return a glimpse for the applicability of such reduced precision.
Posted on Reply
#7
Wirko
lemonadesodaI don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
It's not just INT8, you also have FP8 (in several variants, maybe unsigned too) and INT4 and more. All are usable as long as input data and intermediate results are noisy/unreliable/biased enough but sure they can't replace 16 and 32 everywhere.

I'm also curious why it always has to be powers of 2. Formats like INT12 or FP12 would be usable in some cases too.

Edit: Nvidia claims that FP8 can replace FP16 with no loss of accuracy.
Posted on Reply
#8
Vya Domus
BorisDGSounds not impressive to me. When you think about a small nail sized A16 has 16 billion.
It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.

Silicon Valley is rife with startups that are making more or less the same kind of chips.
WirkoI'm also curious why it always has to be powers of 2. Formats like INT12 or FP12 would be usable in some cases too.
It's not that it has to be powers of 2, it has to be in multiples of 8 bits. It's very unusual and problematic to build processors that work on data which isn't a multiple of a byte. In every computing system 1 byte is the basic unit of storage/processing and everything else that doesn't match that will be problematic to integrate. INT4 still works fine because two INT4s can fit in one byte. The original floating point format was 80bits, which wasn't a power of 2 but was a multiple of 8.
Posted on Reply
#9
mechtech
About 3 times the transistors of the population of earth. How much intelligence this have? ;)
Posted on Reply
#10
Wirko
mechtechAbout 3 times the transistors of the population of earth. How much intelligence this have? ;)
Less than a stick of RAM :¬]
Posted on Reply
#11
GuiltySpark
Vya DomusIt isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.
BTW, that could be said also for GPU, but in the end I don't see anyone saying that's simple architecture to do.
Posted on Reply
#12
First Strike
GuiltySparkBTW, that could be said also for GPU, but in the end I don't see anyone saying that's simple architecture to do.
He practically described how to design a "usable" ML inference chip. Being usable and being bleeding-edge is different, and GPU market has already matured to the state that only the top player survives.
Posted on Reply
#13
Wirko
Vya DomusIt isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.
What ML chips need is an advanced, flexible/programmable, high bandwidth internal network. I don't know what sort of execution units they have, just guessing that they don't even execute user code like CPUs and GPUs do, instead they can be parametrized to define the length of vectors, precision, data paths an so on.
Posted on Reply
#14
AsRock
TPU addict
lemonadesodaI don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
Hehe, maybe it works like their destop CPU's back in the day ( chug chug hot hot!)..

Either way not interested in any thing AI based as you know it will get in the wrong hands, bad enough with whats available now.
Posted on Reply
#15
Everton0x10
AsRockHehe, maybe it works like their destop CPU's back in the day ( chug chug hot hot!)..

Either way not interested in any thing AI based as you know it will get in the wrong hands, bad enough with whats available now.
Hi, what exactly did you mean by: - "bad enough with whats available now."

What bad thing is going on that I don't know about?

I am just starting in this area of IA.
Posted on Reply
Add your own comment
Jan 21st, 2025 09:00 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts