Tuesday, June 5th 2018

AMD Demonstrates 7nm Radeon Vega Instinct HPC Accelerator

AMD demonstrated the world's first GPU built on the 7 nanometer silicon fabrication process, a Radeon Vega Instinct HPC/AI accelerator, with a 7 nm GPU based on the "Vega" architecture, at its heart. This chip is an MCM of a 7 nm GPU die, and 32 GB HBM2 memory stacks over four stacks (4096-bit memory bus width). It's also the first product to feature a removable InfinityFabric interface (competition to NVIDIA's NVLink interface). There will also be variants based on the common PCI-Express 3.0 x16. The card supports hardware virtualization and new deep-learning ops.
Add your own comment

100 Comments on AMD Demonstrates 7nm Radeon Vega Instinct HPC Accelerator

#76
T4C Fantasy
CPU & GPU DB Maintainer
Vayra86Yes, I think Hawaii was the wake up call for AMD because it showed an architecture that really was no longer up to snuff for gaming purposes (too hot, too hungry). Fury X was the HBM test case project as a double edged blade to use in the high end gaming segment for one last time, and Vega represents the completed U-turn to new marketplaces and segments (IGP as well), the 56 and 64 gaming versions of it are just bonus.

'Refinements'... that's just a rebrand on a smaller node, right? :P
yeah well AMD words can be deceiving but hey thats definitely the case for Nvidia too xD
Posted on Reply
#77
TheinsanegamerN
T4C FantasyConsidering the clockspeed is 1200 on 4096bit it actually does matter, thats i think 300mhz higher with 2x bus than vega 10

And fury was like 500mhz with 4gb hbm huge difference
Like I said, hardware numbers dont matter, performance does.

If this chip clocks at 1200 MHz but only hits 1080ti performance after the 1100 series is out, it will be another flop, because having 1200mhz doesnt matter if the chip cant deliver. And after the 300, 400, 500, and vega series, I'm not holding my breath that AMD is going to compete well.

This chip looks promising, but so did vega 64, and we all saw how that went. If this new chip is only 1080ti level, my 480 replacement will most likely have to be an nvidia chip instead of an AMD chip, and I like my 480, so I'd prefer another AMD GPU.
Posted on Reply
#78
bug
TheinsanegamerNLike I said, hardware numbers dont matter, performance does.

If this chip clocks at 1200 MHz but only hits 1080ti performance after the 1100 series is out, it will be another flop, because having 1200mhz doesnt matter if the chip cant deliver. And after the 300, 400, 500, and vega series, I'm not holding my breath that AMD is going to compete well.

This chip looks promising, but so did vega 64, and we all saw how that went. If this new chip is only 1080ti level, my 480 replacement will most likely have to be an nvidia chip instead of an AMD chip, and I like my 480, so I'd prefer another AMD GPU.
Actually, the initial Vega looked rather lame, but then people couldn't stress enough how that was just the "professional" SKU and the gaming oriented one will be tweaked and come with magic drivers.
To this day I can't figure out, if Vega is so great, why hasn't it turned into a mid-range SKU to crush both the GTX 1060 and the RX480/580.
Posted on Reply
#79
bug
robbyou are an idiot. the effective bandwidth is what matters and it's irrelevant how it achieves that.

2048 cores
128 tmus
64 rops
128 bit bus
16000 mhz memory speed

2048 cores
128 tmus
64 rops
256 bit bus
8000 mhz memory speed


BOTH of those would perform exactly the same if the same architecture and running at the same clock speeds. The fact that one uses 128-bit bus and the other uses 256-bit bus is 100% irrelevant if the effective memory bandwidth is the same and everything else is equal.
Technically you're right. What other were trying to say (I think) is that running at a higher frequency tends to mean less latency, which can influence some apps. But that's just one, very specific instance.
Posted on Reply
#80
londiste
Bandwidth is roughly bus width multiplied by memory speed (transfer rate). HBM so far is much slower but sits on a much wider bus.
Video RAM is sensitive to bandwidth much more than it is to latency.

The choice between HBM2, GDDR5 or GDDR5X is not about performance. Given the right combination of speed and bus width they can all get similar enough results.
Considerations are about cost, space on PCB/interposer, power consumption etc.
bugTechnically you're right. What other were trying to say (I think) is that running at a higher frequency tends to mean less latency, which can influence some apps. But that's just one, very specific instance.
That depends on what frequency we are talking about. The frequency at which you can request data from GDDR5(X) and HBM is actually (roughly) the same. GDDR5(X) can just transfer a lot of data back a lot (8-16 times) faster than it can be requested (per pin or effectively per same width of memory bus).
Posted on Reply
#81
bug
Again, this is all theoretical. If the same bandwidth is achieved by two cards, but one runs the memory at 100MHz while the other runs at 1,000MHz, the latter can a have a tenth of the former's latency (assuming that the data is already available to read on the next clock cycle - it usually isn't).
At least that's my understanding/guess about what previous posters were trying to say.
Posted on Reply
#82
londiste
Memory frequency can be directly compared on the same type of memory.
The spec used across different types is data transfer rate per pin (per each bit of memory bus).

When we are talking about bandwidth, that is simple. When we are talking about latency, it gets more complicated. While the actual memory reads are in the same range (actual DRAM clock speeds should remain around 1 GHz), word sizes differ - 16b/32b/64b for GDDR5(X) and 128b/256b for HBM2 - as well as addressing and read delays. I do not remember reading anything indepth about that and testing this is not too easy.

Take these three cards for example:
- Vega 64: gaming.radeon.com/en/product/vega/radeon-rx-vega-64/
- GTX 1080Ti: www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/
- RX 580: www.amd.com/en/products/graphics/radeon-rx-580

Vega 64 (HBM2):
Memory Data Rate: 1.89Gbps
Memory Speed: 945MHz
Memory Interface: 2048-bit
Memory Bandwidth: Up to 484GB/s


945MHz memory on 2048-bit memory bus. Memory data rate is 1.89Gbps - 945MHz at dual data rate
Memory bandwidth is up to 484 GB/s = 1.89 Gbps x 2048 bits of bus / 8 bits per byte

GTX 1080Ti (GDDR5X):
Memory Speed: 11 Gbps
Memory Interface Width: 352-bit
Memory Bandwidth (GB/sec): 484


Memory bandwidth is up to 484 GB/s = 11 Gbps x 352 bits of bus / 8 bits per byte

Note that they do not list the memory clock in specs. Memory speed is actually the data rate spec. This is because there are multiple clocks for GDDR5(X) and none of these are exactly descriptive in a simple way. Knowing what the memory does, lets see what we can deduce about speeds.
GDDR5X is either dual or quad data rate, so at maximum I/O bus is running at quarter of the data rate - 11Gbps / 4 = 2.75 GHz
Data is being requested and read at a quarter of the I/O bus frequency (thanks to 16n prefetch) - 2.75 GHz / 4 = 0.687 GHz

Prefetch on GDDR5 means that where the memory needs to transfer out 32 bits of data at a time, it can internally load 8 (or 16 for GDDR5X) times as much from DRAM into prefetch buffers and for the next 8 (or 16) transfers send data from the buffer instead of going and loading up more data.

RX580 (GDDR5):
Memory speed (effective): 8 Gbps
Memory Interface: 256-bit
Max. Memory Bandwidth: 256 GB/s


Memory bandwidth is up to 256 GB/s = 8 Gbps x 256 bits of bus / 8 bits per byte

Again, memory speed is actually the data rate spec.
GDDR5 is also dual data rate, so I/O bus is running at half the data rate - 8 Gbps / 2 = 4 GHz
Data is being requested and read at a quarter of the I/O bus frequency (thanks to 8n prefetch) - 4 GHz / 4 = 1 GHz
Posted on Reply
#84
T4C Fantasy
CPU & GPU DB Maintainer
Sigma957AMD already said that there will be no gaming 7nm VEGA

community.amd.com/thread/229155

So stop dreaming :-D
Again another misleading title, lisa didnt confirm that nor deny it, everyone is speculating, i seen the computex.
Posted on Reply
#85
efikkan
robbyou are an idiot. the effective bandwidth is what matters and it's irrelevant how it achieves that.
<snip>
BOTH of those would perform exactly the same if the same architecture and running at the same clock speeds. The fact that one uses 128-bit bus and the other uses 256-bit bus is 100% irrelevant if the effective memory bandwidth is the same and everything else is equal.
I just want to add, they will theoretically perform similar, but not necessarily exactly the same, it all depends on the GPU architecture. GPU memory controllers are currently structures as multiple 64-bit controllers, each can only communicate with one cluster/GPC at the time. Having fewer faster memory controllers would require faster scheduling to keep up, while having too many controllers will increase the risk of congestion on one of them. So it all comes down to a balancing act; how the clusters, memory controllers and the scheduler work together. Simply doing a major change on one of them without redesigning the others will create bottlenecks.

It might be wise to distinguish between theoretical specs and actual performance. Just look at:
Vega 64: 4096 cores, 10215 GFlop/s, 483.8 GB/s.
GTX 1080: 2560 cores, 8228 GFlop/s, 320.2 GB/s.
I wonder which one performs better…
Compared to:
GTX 1080 Ti: 3584 cores, 10609 GFlop/s, 484.3 GB/s.
As we all know, Vega 64 have resources comparable to GTX 1080 Ti, so it's not the lack of resources like many AMD fans claim, but the lack of proper resource management.
In conclusion, theoretical specs might be similar on paper, but their actual performance will depend on the complete design.
bugAgain, this is all theoretical. If the same bandwidth is achieved by two cards, but one runs the memory at 100MHz while the other runs at 1,000MHz, the latter can a have a tenth of the former's latency (assuming that the data is already available to read on the next clock cycle - it usually isn't).
If I may add, memory latency is substantial, for DDR it's 50-60 ns for access, even more for GDDR. When you're talking of speeds of 1000 MHz and above, the latency factor becomes negligible, and higher clocks more or less just impacts the bandwidth.
Sigma957AMD already said that there will be no gaming 7nm VEGA

So stop dreaming :-D
AMD have said all along that Vega 20 is targeting the "professional" market.
AMD can certainly change their mind, but as of this moment their position is still unchanged. But why would they bring Vega 20 to the consumer market? It only scales well with simple comute workloads, and Vega 20 is a GPU built for full fp64 support, which have no relevance for consumers.
Posted on Reply
#86
InVasMani
londisteBandwidth is roughly bus width multiplied by memory speed (transfer rate). HBM so far is much slower but sits on a much wider bus.
Video RAM is sensitive to bandwidth much more than it is to latency.

The choice between HBM2, GDDR5 or GDDR5X is not about performance. Given the right combination of speed and bus width they can all get similar enough results.
Considerations are about cost, space on PCB/interposer, power consumption etc.

That depends on what frequency we are talking about. The frequency at which you can request data from GDDR5(X) and HBM is actually (roughly) the same. GDDR5(X) can just transfer a lot of data back a lot (8-16 times) faster than it can be requested (per pin or effectively per same width of memory bus).
True latency is what really matters. Minimal latency with not enough bandwidth isn't good and tons of bandwidth with too high latency isn't good either. That said synchronization will come into play and make latency or bandwidth appear better at times for certain applications use cases. I wonder if GDDR5 suffers from more texture flickering/pop in than HBM since it's has more sequential burst reads while HBM seems better suited for random reads to stream in data on the fly. Vega's HBM is better load balanced, but it's latency and bandwidth are a problem. The bus width wasn't' wide given the HBM clock speed scaling for starters which is why Fury ended up with better bandwidth overall. I think the bus width and higher HBM clock speed will certainly help it a fair amount in a apples to apples comparison to it's predecessor. I'd have liked to have seen more a bump in the HBM clock speed, but it's at least a step in the right direction. Perhaps the HBM will overclock better this time around or there will be a clock refresh on the HBM itself not too far off.
bugAgain, this is all theoretical. If the same bandwidth is achieved by two cards, but one runs the memory at 100MHz while the other runs at 1,000MHz, the latter can a have a tenth of the former's latency (assuming that the data is already available to read on the next clock cycle - it usually isn't).
At least that's my understanding/guess about what previous posters were trying to say.
As far as merely a clock speed to clock speed comparison yes that is true. That's why minimum frame rates improve the most with memory overclocking they are the most latency sensitive and yes synchronization comes into play. The minimum frame rate is a big deal for 4K atm since GPU's haven't trivialized them quite yet like they have for 1080p and even 1440p.
efikkanIt might be wise to distinguish between theoretical specs and actual performance. Just look at:
Vega 64: 4096 cores, 10215 GFlop/s, 483.8 GB/s.
GTX 1080: 2560 cores, 8228 GFlop/s, 320.2 GB/s.
I wonder which one performs better…
Compared to:
GTX 1080 Ti: 3584 cores, 10609 GFlop/s, 484.3 GB/s.
As we all know, Vega 64 have resources comparable to GTX 1080 Ti, so it's not the lack of resources like many AMD fans claim, but the lack of proper resource management.
In conclusion, theoretical specs might be similar on paper, but their actual performance will depend on the complete design.
TMUs/ROPs and pixel clock speed also come into play. If I remember right Vega has less in one of those area's and lower overall latency because of the GDDR5 clock speed being much higher even despite a more narrow bus it's still better on a clock speed per bit basis if I'm not mistaken. That's w/o even getting into the compression advantage either.
Posted on Reply
#87
svan71
I learned not to get excited about AMD video card performance claims. The will not take the performance crown from Nvidia :(
Posted on Reply
#88
R0H1T
bugAgain, this is all theoretical. If the same bandwidth is achieved by two cards, but one runs the memory at 100MHz while the other runs at 1,000MHz, the latter can a have a tenth of the former's latency (assuming that the data is already available to read on the next clock cycle - it usually isn't).
At least that's my understanding/guess about what previous posters were trying to say.
Are we talking about effective mem speed, like 16000MHZ for GDDR5x, or actual speeds? I Think I asked this before ~ but we don't have latency numbers to compare HBM vs GDDR6 or GDDR5x, do we? The VRAM does QDR so it isn't actually running at 8000MHz or whatever, unlike desktop memory which is DDR but runs at similar speeds.
Posted on Reply
#90
bug
medi01Which GPU are you using now, the author of 3 posts?


Yeah, that 290x merely beating nVidia Titan was sooo bad.


That fanboy virtual reality bubble never ceases to amaze me:


www.tomshardware.com/reviews/radeon-r9-290x-hawaii-review,3650-29.html
Yeah, that card wasn't power hungry at all: www.techpowerup.com/reviews/AMD/R9_290X/28.html
Oh wait, silly me, it was offering performance at an affordable price: www.techpowerup.com/reviews/AMD/R9_290X/29.html

Though to be honest the 280 and 285 were really good cards. But 290X was clearly pushed beyond that architecture's sweet spot.
Posted on Reply
#92
Blueberries
AMD is exhausting their last die shrink until 2030 to achieve the performance of Pascal
Posted on Reply
#93
efikkan
BlueberriesAMD is exhausting their last die shrink until 2030 to achieve the performance of Pascal
TSMC 7 nm is comparable to Intel 10 nm, so we can expect at least one more shrink before it stops.

But your point is right; this might be the last "good shrink" for a long time. Moving an inefficient to a new node will not make it great, and since the competition will also move to the new node, the gap is only going to increase.
Posted on Reply
#94
bug
efikkanTSMC 7 nm is comparable to Intel 10 nm, so we can expect at least one more shrink before it stops.

But your point is right; this might be the last "good shrink" for a long time. Moving an inefficient to a new node will not make it great, and since the competition will also move to the new node, the gap is only going to increase.
The may not be that bad. Intel invented tick-tock precisely because moving to a new node while at the same time implementing a new architecture has been traditionally too challenging. It may be more cost effective for AMD (and in turn for us) to move to 7nm first and change the architecture later. Even if that means we'll be stuck with Vega for a while longer.
Posted on Reply
#95
efikkan
bugThe may not be that bad. Intel invented tick-tock precisely because moving to a new node while at the same time implementing a new architecture has been traditionally too challenging. It may be more cost effective for AMD (and in turn for us) to move to 7nm first and change the architecture later. Even if that means we'll be stuck with Vega for a while longer.
Sure, which is basically what they are already doing with the Vega 20.

You see people arguing that a shrunk Vega will matter, but we know it wouldn't even come close to Pascal in efficiency. AMD isn't even planning a consumer Vega on 7 nm.
Posted on Reply
#96
bug
efikkanSure, which is basically what they are already doing with the Vega 20.

You see people arguing that a shrunk Vega will matter, but we know it wouldn't even come close to Pascal in efficiency. AMD isn't even planning a consumer Vega on 7 nm.
Eh, like anyone cares about matching Pascal, two years after Pascal's release... The only way for people to be interested in that would be if the cards would sell for a lot less than Pascal counterparts. Thanks to Vega being stuck with HBM, that's not going to happen.
I was strictly talking about the "Moving an inefficient to a new node will not make it great" part - it will not make it great, but maybe that's not the point. Though God knows what AMD is thinking.
Posted on Reply
#97
jdubo
Prima.Vera????
My thoughts exactly... ???
Posted on Reply
#98
RichF
svan71I learned not to get excited about AMD video card performance claims. The will not take the performance crown from Nvidia :(
AMD being small is unfortunate for competition because the company tries to create a Jack-of-All-Trades design. While this can be profitable for the company, as we've seen with Zen, it's not the path to class-leading performance. When your competitors have more money than you do they have more luxuries. Money = luxury budget. It's no different from the "real world" experience of people. It's why Xerox could afford to have PARC, for a time. For most people, gaming is a luxury business. Capturing PC gaming market profits is good, as is having robust pioneering R&D (like a PARC) but the more luxurious something is (distance from the Jack-of-All-Trades middle), the harder it will be to convince a board to approve it. This is compounded as the company shrinks in comparison with its competitors.

Zen has been a huge win but it came with compromises over pure performance in the PC enthusiast space, like low clocks/minimal overclocking and latency. The node is a big factor but Zen cores could have been designed for more performance at higher energy consumption and increased die space. AMD specifically said it wanted to make a core that could scale all the way from very low-wattage mobile to high-performance. That's nice but the word "scale" isn't a magic wand that erases the benefit of having more targeted designs. Those cost more money, though, to make and usually involve more risk. The key with Zen was to do just enough and keep it affordable. The high efficiency of the SMT was really a big design win, which is a bit droll considering its bet on CMT previously.

Gaming enthusiasts also are competing with AI/science/workstation/supercomputing, crypto, the laptop space, "consoles", etc. AMD wants one core that can do it all but what we would like is a special core for each niche — like a big core x86 with a loose library for high performance instead of just bigger core count. If AMD were bigger and richer it could do more of that. Crypto is probably more of a target for AMD's designers than gaming, outside of the console space. Crypto seems to fit better with the AI/science/workstation/supercomputing "pro compute" area. And, it, not PC gaming, holds the promise of demand higher than supply.

I wonder, for CPUs, if Big-Little will become a desktop trend. It seems to offer the benefits of turbo to a greater degree. Create several more powerful cores and supplement them with smaller ones. I'm sure articles have been written on this. But, maybe we'll someday see a desktop CPU with two large high-performance cores surrounded by smaller ones. Of course, like the enthusiast-grade gaming GPU, the enthusiast gamer x86 CPU seems to be mostly a thing of the past — or an afterthought at best. What advantage do those two big cores have for servers, for instance? I suppose it could be useful for some pro markets, though, like financial analysis (e.g. high-speed trading). With that, though, I don't see why supercomputers can't be harnessed. Is it really necessary to focus on a high-clock micro for those systems? Is the latency that bad?
Posted on Reply
#99
londiste
RichFI wonder, for CPUs, if Big-Little will become a desktop trend. It seems to offer the benefits of turbo to a greater degree. Create several more powerful cores and supplement them with smaller ones. I'm sure articles have been written on this.
It won't. Scaling on current desktop CPUs is good enough with the clock range and power gating to run at from 100+ watts at full load to a couple watts when idle. Power and cooling at a this low level is not a problem for a desktop computer. This is different in (ultra)portable space where tenths of a watt matter a lot, especially for devices running on battery (read: phones).
Posted on Reply
Add your own comment
Nov 27th, 2024 05:32 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts