• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Instinct MI200: Dual-GPU Chiplet; CDNA2 Architecture; 128 GB HBM2E

Joined
Apr 24, 2020
Messages
2,837 (1.61/day)
A100 were sold in packs of 10, if I'm not mistaken, for $200k. I don't see why AMD would ask half of that sum for a vastly faster product.

I've PCIe-versions of A100 quoted at $10k. The HGX-version probably cost more, and maybe is the one you're talking about for 10-for-$200k (I never seen the HGX-version quoted personally).

The PCIe-versions won't have cache-coherency, and will have fewer links. Anyone who wants 2-or-fewer MI200s (or A100s) probably wants the PCIe version. The HGX A100 / OAM MI200 is really for customers who run 4x GPUs, 8x GPUs or more (which is probably why it makes sense to sell them in packs of 10).
 

Sol

New Member
Joined
Nov 10, 2021
Messages
4 (0.00/day)
I don't get how that math works.
A100 - 54 billion transistors
MI200 - 58 billion transistors (29 + 29), yet it runs circles around A100.


You mean 22% faster is "barely faster"?
ADL was recently praised like a miracle for being this "barely faster" in ST only lmao.
 
Joined
Jun 5, 2021
Messages
284 (0.21/day)
I don't get how that math works.
A100 - 54 billion transistors
MI200 - 58 billion transistors (29 + 29), yet it runs circles around A100.


You mean 22% faster is "barely faster"?
Its basically 4 rx6900 xt glued together ofcouse its gonna beat it
 
Joined
Mar 10, 2010
Messages
11,880 (2.17/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
Still Dual GPU's acting as one. Yes faster with the new fabric but still basic Idea. When you get into 4k or even 8k gaming single card cannot handle it pushing over 100 fps. I tried that with one 1070ti forget it. My two in SLI can push 100 fps easy. But since Nvidia and AMD dropped that tech. I bought a 3080TI and tried it at 4k could not push 100 fps constantly. Ya it looks great but my eyes can see the lag and frame buffering trying to keep up. Now lucky friend of mine has two 3090's in SLI and man 8k res at 150 FPS looks soo clean and perfect. But I do not have 5 grand lying around to afford such nice things.
This isn't crossfire, it doesn't even have a video output, isn't running games and can't and doesn't work how you think, this IS new tech.
 
Joined
Oct 12, 2005
Messages
735 (0.10/day)
This isn't crossfire, it doesn't even have a video output, isn't running games and can't and doesn't work how you think, this IS new tech.
You are right about not being crossfire. But here, the thing is it still see both chip independently and not 1. The advantages is they are being linked by a very fast infinity fabrics link (800 GB/s) That is much faster than going thru PCI-E (where frequently the second cards was running at PCI-E 3.0 8X (or both). (8 GB/s with much higher latency).

And that is one of the key thing here, latency. The 2 chips being so close, they can access the other chip memory with minimal impact versus if it had to go thru the PCI-E bus or any other external connection. And at last, something spec sheet do not really tell, is what AMD implemented for cache coherency and memory sharing.
 
Joined
Jun 5, 2021
Messages
284 (0.21/day)
I meant
Except that it's not, RDNA and CDNA are different architecture. CDNA is more compute oriented and designed for compute heavy workload where RDNA is designed toward graphic workload.
In transistors size rx 6900 xt has 26.8 billion and this has 29+29 per chiplet
 
Joined
Mar 10, 2010
Messages
11,880 (2.17/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
You are right about not being crossfire. But here, the thing is it still see both chip independently and not 1. The advantages is they are being linked by a very fast infinity fabrics link (800 GB/s) That is much faster than going thru PCI-E (where frequently the second cards was running at PCI-E 3.0 8X (or both). (8 GB/s with much higher latency).

And that is one of the key thing here, latency. The 2 chips being so close, they can access the other chip memory with minimal impact versus if it had to go thru the PCI-E bus or any other external connection. And at last, something spec sheet do not really tell, is what AMD implemented for cache coherency and memory sharing.
Read the white paper, it's seen as one chip, one cores the master number two is it's slave , not like anything else I might add, new tech, new IP new ways.

Hopefully they will carry over well to consumer cards.
 
Joined
Oct 12, 2005
Messages
735 (0.10/day)
Read the white paper, it's seen as one chip, one cores the master number two is it's slave , not like anything else I might add, new tech, new IP new ways.

Hopefully they will carry over well to consumer cards.
Do you have that whitepaper?

Most people i see say it being show to the OS as 2 chip with 64 GB devices (but with many tools for memory coherency)

On this whitepaper, nothing say what you say

i know what you mean is what leaker said RDNA 3 will be, but it do not look like this is the case for this architecture. But those are made to be grouped together in large cluster so that do not really matter that much in the end as long as you are able to split your code and data into chunk that each GPU can digest.
 
Joined
Jun 5, 2021
Messages
284 (0.21/day)
Do you have that whitepaper?

Most people i see say it being show to the OS as 2 chip with 64 GB devices (but with many tools for memory coherency)

On this whitepaper, nothing say what you say

i know what you mean is what leaker said RDNA 3 will be, but it do not look like this is the case for this architecture. But those are made to be grouped together in large cluster so that do not really matter that much in the end as long as you are able to split your code and data into chunk that each GPU can digest.
Thats gonna be hard for rdna 3 for the os to read the gpu as one and not sli... plus aren't games having a hard time splitting the workload on combined gpu's ?
 
Joined
Oct 12, 2005
Messages
735 (0.10/day)
Thats gonna be hard for rdna 3 for the os to read the gpu as one and not sli... plus aren't games having a hard time splitting the workload on combined gpu's ?
Let say CDNA2 is similar to the first gen Threadripper where full Zen1 chip were put on the same socket. For the OS it was similar than having multi socket since each cpu had their own memory controller.

From what we are hearing, RDNA3 might look a bit more like Zen2/3 where there is some kind of I/O die. In this case, it could be a part of one chip act as a bridge similar to the I/O die or there could be a bridge between the two chip that could do that.

The main thing are how to handle different memory zone. In Zen 1 Threadripper, there are many memory controller to deal with (although the OS can see them as one with NUMA). In zen2 threadripper, there is just 1 memory controller and NUMA is not used.

If RDNA 3 have just 1 die with memory and the second access it via a bridge, or there is an i/o die that is also the memory controller, it could be seen by the OS as 1 chip. There is also how they communicate with the OS, if it's hidden behind an i/o die or if it have to go thru the first "Master Die" to access the PCI-E bus.

Everything is still rumours but it look like AMD figured it out for RDNA 3. They do not need to implemented it as much for CDNA 2 as most software running on it are already made to scale with multi GPU. Doesn't mean they won't do something similar for CDNA3.
 
Joined
Mar 10, 2010
Messages
11,880 (2.17/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
Let say CDNA2 is similar to the first gen Threadripper where full Zen1 chip were put on the same socket. For the OS it was similar than having multi socket since each cpu had their own memory controller.

From what we are hearing, RDNA3 might look a bit more like Zen2/3 where there is some kind of I/O die. In this case, it could be a part of one chip act as a bridge similar to the I/O die or there could be a bridge between the two chip that could do that.

The main thing are how to handle different memory zone. In Zen 1 Threadripper, there are many memory controller to deal with (although the OS can see them as one with NUMA). In zen2 threadripper, there is just 1 memory controller and NUMA is not used.

If RDNA 3 have just 1 die with memory and the second access it via a bridge, or there is an i/o die that is also the memory controller, it could be seen by the OS as 1 chip. There is also how they communicate with the OS, if it's hidden behind an i/o die or if it have to go thru the first "Master Die" to access the PCI-E bus.

Everything is still rumours but it look like AMD figured it out for RDNA 3. They do not need to implemented it as much for CDNA 2 as most software running on it are already made to scale with multi GPU. Doesn't mean they won't do something similar for CDNA3.
Rumours have rDNA 3 tapped out as well.
I could be getting this confused with rdna3, good point.

As is the white paper being lite on details.
 
Joined
Jul 9, 2015
Messages
3,464 (0.98/day)
System Name M3401 notebook
Processor 5600H
Motherboard NA
Memory 16GB
Video Card(s) 3050
Storage 500GB SSD
Display(s) 14" OLED screen of the laptop
Software Windows 10
Benchmark Scores 3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.
I meant

In transistors size rx 6900 xt has 26.8 billion and this has 29+29 per chiplet
So, how is that "4 6900 'glued together' then"? :D

I also recall that Intel's "glued together" comment didn't age well...
 

Sms

New Member
Joined
Nov 13, 2021
Messages
1 (0.00/day)
Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?
 
Joined
Apr 24, 2020
Messages
2,837 (1.61/day)
Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

Do anyone know why the performance of FP64 is the same as FP32. My understanding that single precision can get 2X speed up comparing to double for free?

Because they designed it that way.

Usually, 32 bit performance is more important. However, it seems like ORNL asked for double precision performance.

It should be noted that CPUs usually do 64 bit scalar at the same speed as 32 scalar due to the sizing of 64 bit registers.
 
Top