Thursday, June 30th 2022
AMD WMMA Instruction is Direct Response to NVIDIA Tensor Cores
AMD's RDNA3 graphics IP is just around the corner, and we are hearing more information about the upcoming architecture. Historically, as GPUs advance, it is not unusual for companies to add dedicated hardware blocks to accelerate a specific task. Today, AMD engineers have updated the backend of the LLVM compiler to include a new instruction called Wave Matrix Multiply-Accumulate (WMMA). This instruction will be present on GFX11, which is the RDNA3 GPU architecture. With WMMA, AMD will offer support for processing 16x16x16 size tensors in FP16 and BF16 precision formats. With these instructions, AMD is adding new arrangements to support the processing of matrix multiply-accumulate operations. This is closely mimicking the work NVIDIA is doing with Tensor Cores.
AMD ROCm 5.2 API update lists the use case for this type of instruction, which you can see below:
Source:
via VideoCardz
AMD ROCm 5.2 API update lists the use case for this type of instruction, which you can see below:
rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.
rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
79 Comments on AMD WMMA Instruction is Direct Response to NVIDIA Tensor Cores
Well well, so the consensus is moving towards dedicated hardware.
Let's see where RDNA3's power budget goes...
I need to read better it seems
1. Ray-tracing;
2. DLSS;
3. CUDA, tensor, etc...
I haven't seen anything on shelves for a looong time tbh. Its just recently that we're getting some semblance of normal availability back, and as usual, Nvidia is faster in restocking the sales channels.
Wasting silicon for special hardware just for some ML isn't the right way, once we achieve perfect geometry then I'm all for it.
More information is required though tbf, but this doesn't sound like specialised fix function hardware like tensor core's to me, just optimised use of what they're simd array could theoretically do.
"rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts.".
As they say.
Why did you buy it in the first place?
A single function ASIC is easier to make and faster to implement, that why NV didn't have to do more and could add it quicker.
AMD's way is harder and needs more engineering work, so even after NV announced it, they took sometime to implement a similar function, but the benefit is more as they're reusing mostly the same silicon die space they have before, it's just more tweaked to do more specialised work more while still being able to do other things in the same time, so it wont be like a fixed function ASIC that can only do a single thing.
It's like saying, NV is adding 15% more die area to have this function. AMD took 2 more years but they only needed to have 5% more die area. And they might be able to use it for future uses as well for other things.
Never gets old, does it :laugh:
Tensor cores is on its 4th gen with Ada now, probably takes less than 5% die space. Well if money is everything to you, then why are you spending them on useless PC stuff anyways.
And yes all of us can do little things to make that day come forth.
Unless of course Nvidia is willing to throw another billion or two each year for the next decade or so.
The time and money spent on this is absurd and they pile on
Well AMD is adding instructions for ML, could that be for FSR3.0 I wonder :roll: