Thursday, June 30th 2022

AMD WMMA Instruction is Direct Response to NVIDIA Tensor Cores

AMD's RDNA3 graphics IP is just around the corner, and we are hearing more information about the upcoming architecture. Historically, as GPUs advance, it is not unusual for companies to add dedicated hardware blocks to accelerate a specific task. Today, AMD engineers have updated the backend of the LLVM compiler to include a new instruction called Wave Matrix Multiply-Accumulate (WMMA). This instruction will be present on GFX11, which is the RDNA3 GPU architecture. With WMMA, AMD will offer support for processing 16x16x16 size tensors in FP16 and BF16 precision formats. With these instructions, AMD is adding new arrangements to support the processing of matrix multiply-accumulate operations. This is closely mimicking the work NVIDIA is doing with Tensor Cores.

AMD ROCm 5.2 API update lists the use case for this type of instruction, which you can see below:
rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.

rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
Source: via VideoCardz
Add your own comment

79 Comments on AMD WMMA Instruction is Direct Response to NVIDIA Tensor Cores

#1
nguyen
Oh so apparently tensor cores are not so useless now :roll:
Posted on Reply
#2
Vayra86
Well well, so the consensus is moving towards dedicated hardware.

Let's see where RDNA3's power budget goes...

I need to read better it seems
Posted on Reply
#3
Bomby569
nguyenOh so apparently tensor cores are not so useless now :roll:
they are if you don't have them, they aren't if you have them :D
Posted on Reply
#4
ARF
This is a good news. I hope AMD will show us how much the ray-tracing performance is improved from RDNA 2 to RDNA 3 - should be multiple times because RDNA 2's RT performance was abysmal.
Posted on Reply
#5
Bomby569
ARFThis is a good news. I hope AMD will show us how much the ray-tracing performance is improved from RDNA 2 to RDNA 3 - should be multiple times because RDNA 2's RT performance was abysmal.
Ray tracing is a joke anyway, no one is missing much for not having it. It's true we hadn't much AAA releases, still after so long we have but a few games that done anything meaninfull with it. Seems more like a must have buzzword for the box than anything else
Posted on Reply
#6
ARF
Bomby569Ray tracing is a joke anyway, no one is missing much for not having it. It's true we hadn't much AAA releases, still after so long we have but a few games that done anything meaninfull with it. Seems more like a must have buzzword for the box than anything else
It is not a joke because it makes the nvidia cards flying off the shelves - the nvidia users claim it's better because:
1. Ray-tracing;
2. DLSS;
3. CUDA, tensor, etc...
Posted on Reply
#7
Patriot
This is the third generation of matrix cores from AMD, CDNA 1/2 now getting added to the consumer line as they unify a certain amount of features for ROCm support with CDNA3 and RDNA3.
Posted on Reply
#8
Vayra86
ARFIt is not a joke because it makes the nvidia cards flying off the shelves - the nvidia users claim it's better because:
1. Ray-tracing;
2. DLSS;
3. CUDA, tensor, etc...
You mean flying off the pallets to miners...?

I haven't seen anything on shelves for a looong time tbh. Its just recently that we're getting some semblance of normal availability back, and as usual, Nvidia is faster in restocking the sales channels.
Posted on Reply
#9
Unregistered
nguyenOh so apparently tensor cores are not so useless now :roll:
They are not useless per se as they do something, especially nVidia which I believe just want to focus on professional application of ML, for games I still think they are useless, the games IMO still look awful, mainly to low polygons count and textures, I don't see the point of having pretty lighting but cubes instead of mugs or ball, and textures are of so low quality (though vram is more more important).
Wasting silicon for special hardware just for some ML isn't the right way, once we achieve perfect geometry then I'm all for it.
#10
TheoneandonlyMrK
nguyenOh so apparently tensor cores are not so useless now :roll:
Except unlike Nvidia, AMD seam to have Not gone with seperated fix function hardware, and Instead use the 64bit wavefronts they already had via instructions and likely bigger registers.

More information is required though tbf, but this doesn't sound like specialised fix function hardware like tensor core's to me, just optimised use of what they're simd array could theoretically do.


"rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts.".


As they say.
Posted on Reply
#11
nguyen
Xex360They are not useless per se as they do something, especially nVidia which I believe just want to focus on professional application of ML, for games I still think they are useless, the games IMO still look awful, mainly to low polygons count and textures, I don't see the point of having pretty lighting but cubes instead of mugs or ball, and textures are of so low quality (though vram is more more important).
Wasting silicon for special hardware just for some ML isn't the right way, once we achieve perfect geometry then I'm all for it.
Just don't play games then, wait until you are on your deathbed, I'm sure games will look amazing then
Posted on Reply
#12
Vayra86
TheoneandonlyMrKExcept unlike Nvidia, AMD seam to have Not gone with seperated fix function hardware, and Instead use the 64bit wavefronts they already had via instructions and likely bigger registers.

More information is required though tbf, but this doesn't sound like specialised fix function hardware like tensor core's to me, just optimised use of what they're simd array could theoretically do.


"rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts.".


As they say.
Wow, you read better than I did.
Posted on Reply
#13
jigar2speed
ARFIt is not a joke because it makes the nvidia cards flying off the shelves - the nvidia users claim it's better because:
1. Ray-tracing;
2. DLSS;
3. CUDA, tensor, etc...
RTX 3070 user here, they are useless for me.
Posted on Reply
#14
ARF
jigar2speedRTX 3070 user here, they are useless for me.
Sell the card and buy a 16 GB Radeon RX 6800 XT or a 12 GB Radeon RX 6700 XT.

Why did you buy it in the first place?
Posted on Reply
#15
Xajel
nguyenOh so apparently tensor cores are not so useless now :roll:
AMD's idea is leaning against fixed function ASIC (like the tensor cores) which can be considered some how a waste of silicon because the die area will sit down doing nothing for most of the time, they like to make the silicon die do other things as well.

A single function ASIC is easier to make and faster to implement, that why NV didn't have to do more and could add it quicker.

AMD's way is harder and needs more engineering work, so even after NV announced it, they took sometime to implement a similar function, but the benefit is more as they're reusing mostly the same silicon die space they have before, it's just more tweaked to do more specialised work more while still being able to do other things in the same time, so it wont be like a fixed function ASIC that can only do a single thing.

It's like saying, NV is adding 15% more die area to have this function. AMD took 2 more years but they only needed to have 5% more die area. And they might be able to use it for future uses as well for other things.
Posted on Reply
#17
TheoneandonlyMrK
Vayra86Wow, you read better than I did.
Well back when rapid packed math was introduced I couldn't believe they had not also incorporated this, you have a 64bit wavefront that can already do multiple math ops on one wave in one pass, so why not do quadratics like that, clearly patience was needed by me.
Posted on Reply
#18
nguyen
XajelAMD's idea is leaning against fixed function ASIC (like the tensor cores) which can be considered some how a waste of silicon because the die area will sit down doing nothing for most of the time, they like to make the silicon die do other things as well.

A single function ASIC is easier to make and faster to implement, that why NV didn't have to do more and could add it quicker.

AMD's way is harder and needs more engineering work, so even after NV announced it, they took sometime to implement a similar function, but the benefit is more as they're reusing mostly the same silicon die space they have before, it's just more tweaked to do more specialised work more while still being able to do other things in the same time, so it wont be like a fixed function ASIC that can only do a single thing.

It's like saying, NV is adding 15% more die area to have this function. AMD took 2 more years but they only needed to have 5% more die area. And they might be able to use it for future uses as well for other things.
Same with RT then, AMD's implementation of RT is just weak sauce.

Tensor cores is on its 4th gen with Ada now, probably takes less than 5% die space.
R0H1TIf you say so :slap:

www.tomshardware.com/news/nvidia-rtx-gpus-worth-the-money,37689.html
Never gets old, does it :laugh:
Well if money is everything to you, then why are you spending them on useless PC stuff anyways.
Posted on Reply
#19
R0H1T
What do you mean? Selling someone/anything on that kind of sales pitch is just bad period ~ I'd rather see (all) wars end in my lifetime than be hung up on "real-time" ray tracing!

And yes all of us can do little things to make that day come forth.
Posted on Reply
#20
nguyen
R0H1TWhat do you mean? Selling someone/anything on that kind of sales pitch is just bad period ~ I'd rather see (all) wars end in my lifetime than be hung up on "real-time" ray tracing!

And yes all of us can do little things to make that day come forth.
well IMO Gamersnexus was being a dumbass for calling that article out, knowing now that those GPU get to keep insane resell value 3.5 years after launch
Posted on Reply
#21
beedoo
nguyenSame with RT then, AMD's implementation of RT is just weak sauce.
Can we give up this A < B, therefore A must be crap mentality - it's getting boring.
Posted on Reply
#22
Vya Domus
You guys realize this is completely irrelevant for consumers, right ?
Posted on Reply
#23
R0H1T
nguyenwell IMO Gamersnexus was being a dumbass for calling that article out, knowing now that those GPU get to keep insane resell value 3.5 years after launch
Purely in terms of resale value yes it's outdone probably any other dGPUs in the past, but then you forgot the backdrop? A one in 100 year global pandemic. As for your particular point about Tensor cores, correct me if I'm wrong, outside of DLSS are they actually that useful anywhere else? The way things are shaping up right now DLSS vs FSR will end up almost exactly as Gsync vs Freesync!

Unless of course Nvidia is willing to throw another billion or two each year for the next decade or so.
Posted on Reply
#24
Bomby569
Vya DomusYou guys realize this is completely irrelevant for consumers, right ?
Nvidia spent time and money and even rename the all brand of GPU's and AMD seems to try all it can to do some nice benchmarks, and yet for us consumers RTX is nothing, zero, a couple of games, a gimmick

The time and money spent on this is absurd and they pile on
Posted on Reply
#25
nguyen
R0H1TPurely in terms of resale value yes it's outdone probably any other dGPUs in the past, but then you forgot the backdrop? A one in 100 year global pandemic. As for your particular point about Tensor cores, correct me if I'm wrong, outside of DLSS are they actually that useful anywhere else? The way things are shaping up right now DLSS vs FSR will end up almost exactly as Gsync vs Freesync!
Tensor cores: DLDSR, Nvidia broadcast, Nvidia Canvas, RT denoise (or even RT upscaling)

Well AMD is adding instructions for ML, could that be for FSR3.0 I wonder :roll:
Posted on Reply
Add your own comment
May 21st, 2024 09:43 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts