Tuesday, March 27th 2018

NVIDIA Announces the DGX-2 System - 16x Tesla V100 GPUs, 30 TB NVMe Memory for $400K

NVIDIA's DGX-2 is likely the reason why NVIDIA seems to be slightly less enamored with the consumer graphics card market as of late. Let's be honest: just look at that price-tag, and imagine the rivers of money NVIDIA is making on each of these systems sold. The data center and deep learning markets have been pouring money into NVIDIA's coffers, and so, the company is focusing its efforts in this space. Case in point: the DGX-2, which sports performance of 1920 TFLOPs (Tensor processing); 480 TFLOPs of FP16; half again that value at 240 TFLOPs for FP32 workloads; and 120 TFLOPs on FP64.

NVIDIA's DGX-2 builds upon the original DGX-1 in all ways thinkable. NVIDIA looks at these as readily-deployed processing powerhouses, which include everything any prospective user that requires gargantuan amounts of processing power can deploy in a single system. And the DGX-2 just runs laps around the DGX-1 (which originally sold for $150K) in all aspects: it features 16x 32GB Tesla V100 GPUs (the DGX-1 featured 8x 16 GB Tesla GPUs); 1.5 TB of system ram (the DGX-1 features a paltry 0.5 TB); 30 TB NVMe system storage (the DGX-1 sported 8 TB of such storage space), and even includes a pair of Xeon Platinum CPUs (admittedly, the lowest performance increase in the whole system).
The DGX-2 has been made possible by NVIDIA's deployment of what it's calling their NVSwitch, which enables 300 GB/s chip-to-chip communication at 12 times the speed of PCIe. Paired with the company's NVLink2, this enables sixteen GPUs to be grouped together in a single system, for a total bandwidth going beyond 14 TB/s. NVIDIA is touting this as a 2 Petaflop-capable system, which isn't that hard to imagine with all of the underlying hardware - it does include 81,920 CUDA cores, and 10,240 Tensor processing cores (which are what NVIDIA uses to achieve that 2 Petaflop figure, if you were wondering. The DGX-2 consumes power that is adequate to its innards - some 10 KW of power in operation, and the whole system weighs 350 pounds.
Some of NVIDIA's remarks about this system follow:

NVSwitch: A Revolutionary Interconnect Fabric
NVSwitch offers 5x higher bandwidth than the best PCIe switch, allowing developers to build systems with more GPUs hyperconnected to each other. It will help developers break through previous system limitations and run much larger datasets. It also opens the door to larger, more complex workloads, including modeling parallel training of neural networks.

NVSwitch extends the innovations made available through NVIDIA NVLink, the first high-speed interconnect technology developed by NVIDIA. NVSwitch allows system designers to build even more advanced systems that can flexibly connect any topology of NVLink-based GPUs.

NVIDIA DGX-2: World's First Two Petaflop System
NVIDIA's new DGX-2 system reached the two petaflop milestone by drawing from a wide range of industry-leading technology advances developed by NVIDIA at all levels of the computing stack.

DGX-2 is the first system to debut NVSwitch, which enables all 16 GPUs in the system to share a unified memory space. Developers now have the deep learning training power to tackle the largest datasets and most complex deep learning models.

Combined with a fully optimized, updated suite of NVIDIA deep learning software, DGX-2 is purpose-built for data scientists pushing the outer limits of deep learning research and computing. DGX-2 can train FAIRSeq, a state-of-the-art neural machine translation model, in less than two days - a 10x improvement in performance from the DGX-1 with Volta, introduced in September.
Source: AnandTech
Add your own comment

28 Comments on NVIDIA Announces the DGX-2 System - 16x Tesla V100 GPUs, 30 TB NVMe Memory for $400K

#1
cucker tarlson
I'm not much into this new tech and only care about geforce lineup, but holy crap, two petaflop at deep learning and 240 teraflop at fp32 make my head spin.
Posted on Reply
#2
TheGuruStud
Nvidia calculates their prices with Volta.
Posted on Reply
#3
TheoneandonlyMrK
500x in 5 years, oh and an extra £399,000 too , what the actual p
I belly laughed at this.
Posted on Reply
#4
cucker tarlson
theoneandonlymrk500x in 5 years, oh and an extra £399,000 too , what the actual p
I belly laughed at this.
margin of error difference :p

500x performance for 400x price, come one come all,this is once in a lifetime opportunity.
Posted on Reply
#5
xorbe
That would be a very real problem if nVidia were to ever abandon the gaming market for higher margin segments.
Posted on Reply
#6
Vayra86
Y'all know there is only one question here.
xorbeThat would be a very real problem if nVidia were to ever abandon the gaming market for higher margin segments.
You have a good sense of humour I like it but no, that won't happen anytime soon. Gaming is a cash cow and deep learning is a new venture.
Posted on Reply
#7
the54thvoid
Super Intoxicated Moderator
Whether you love them or loathe them, Nvidia makes some pretty mental shit. I always thought DGX was for driving stuff and I'm thinking, how does that fit in a car..... Think I'll stick to making crayon drawings.
Posted on Reply
#9
R-T-B
MasterInvader"But can it run Crysis"
Yes.
Posted on Reply
#10
_JP_
MasterInvader"But can it run Crysis"
So fast you don't even have to play it, it will show you how lousy you are at it without input, @4K120fps.
Posted on Reply
#11
Fouquin
So to achieve the rated 2PF speeds you need to lock into nVidia's ecosystem with their proprietary Tensor cores. I guess this is great for organizations already on nVidia's plan. For everyone else, would the AMD/Inventec P47 rack (1PF full, 2PF half) be a more enticing offer from a deployment standpoint, considering it's standard mix of x86-64 and GPGPU?
Posted on Reply
#14
Fluffmeister
R-T-BThe Uber accident was completely the pedestrians fault, but I digress...
Come on... play along. When humans get behind the wheel there are never any accidents. AI (and Nvidia) are a danger to car drivers and pedestrians all over the world.
Posted on Reply
#15
Xzibit
R-T-BThe Uber accident was completely the pedestrians fault, but I digress...
Probably she wasn't inside a cross-walk

She was crossing from left to right. walked across a car lane before she got hit on the other car lane.

The car was equipped with 360 lidar with additonal lidar in the front and yet it failed to detect her in order to slow down or avoid.

according to police reports, it made no attempts to brake before the collision.

Something obviously went wrong.
Posted on Reply
#16
Fluffmeister
XzibitSomething obviously went wrong.
No shit. Titanic, Hindenburg, Challenger, Columbia...

Something obviously went wrong. Come on, be a little less jaded Nvidia Master.
Posted on Reply
#17
boredsysadmin
FouquinSo to achieve the rated 2PF speeds you need to lock into nVidia's ecosystem with their proprietary Tensor cores. I guess this is great for organizations already on nVidia's plan. For everyone else, would the AMD/Inventec P47 rack (1PF full, 2PF half) be a more enticing offer from a deployment standpoint, considering it's standard mix of x86-64 and GPGPU?
First of all AMD Project 47 is 1PT FULL and 1/2 PF half. So to reach 2PF you'd need 2 racks. Compare it approx 8U for DGX-2. Power usage for 2 racks of P47th - 66kw vs 10kW for a single DGX-2
Cost? I will absolutely guarantee you that two full racks or 40 servers with top-end CPU, GPU, and some local storage would cost SIGNIFICANTLY more than 400k. I guarantee it. I couldn't find a single estimation, but AMD compares it original IBM's Roadrunner which cost around $100m, AMD would cost "much less", means at least few millions.
I'm not nVidia fanboi, but numbers don't stack well for AMD here. sorry.
Posted on Reply
#18
Xzibit
boredsysadminFirst of all AMD Project 47 is 1PT FULL and 1/2 PF half. So to reach 2PF you'd need 2 racks. Compare it approx 8U for DGX-2. Power usage for 2 racks of P47th - 66kw vs 10kW for a single DGX-2
Cost? I will absolutely guarantee you that two full racks or 40 servers with top-end CPU, GPU, and some local storage would cost SIGNIFICANTLY more than 400k. I guarantee it. I couldn't find a single estimation, but AMD compares it original IBM's Roadrunner which cost around $100m, AMD would cost "much less", means at least few millions.
I'm not nVidia fanboi, but numbers don't stack well for AMD here. sorry.
DXG-2 GPU Throughput:
FP16: 480 TFLOPs
FP32: 240 TFLOPs
FP64: 120 TFLOPs
Tensor (Deep Learning): 1.92 PFLOPs

Project 47 GPU Throughput:
FP16: 1.96 PFLOPs
FP32: 984 TFLOPs
Posted on Reply
#19
the54thvoid
Super Intoxicated Moderator
XzibitProbably she wasn't inside a cross-walk

She was crossing from left to right. walked across a car lane before she got hit on the other car lane.

The car was equipped with 360 lidar with additonal lidar in the front and yet it failed to detect her in order to slow down or avoid.






Something obviously went wrong.
The makers of the Lidar and Radar say that UBER disabled their system. UBER have not commented on this but they did write the software.
Posted on Reply
#20
Bytales
The question is not wheter it can run Crysis, but wheter it can mine ALT-Coins/Shitcoins, worth less than a CENT each !
Posted on Reply
#21
renz496
xorbeThat would be a very real problem if nVidia were to ever abandon the gaming market for higher margin segments.
That will never happen. Sure those tesla is super expensive per unit. But no matter how expensive it is the revenue nvidia get from their gaming segment still easily eclipse the revenue they get from selling quadro and tesla. In fact nvidia already admit that majority of their R&D spending is being sustained by their gaming revenue. They able to push Deep Learning to where it is right now thanks to million of gamers around the world that buys thousands of GPU every year. Right now the new buzz will be "real time ray tracing" in games (back in 2009 it was tessellation). Like it or not gamer will buy new GPU so they can get to run this feature with acceptable performance.
Posted on Reply
#22
cucker tarlson
XzibitDXG-2 GPU Throughput:
FP16: 480 TFLOPs
FP32: 240 TFLOPs
FP64: 120 TFLOPs
Tensor (Deep Learning): 1.92 PFLOPs

Project 47 GPU Throughput:
FP16: 1.96 PFLOPs
FP32: 984 TFLOPs
lol
you mean this



vs this





Also, nvidia has saturn V with 80 petaflops fp32 and 660 petaflops AI.
Posted on Reply
#23
jabbadap
Well yeah but s/he is correct. You don't need two of them to beat one DXG-2 station(By pure fp16/fp32 TFlops). Sure Project 47 does not have Tensors or full fp64 compute(Full is 1/2 fp32, MI25 has 1/16 fp32) power so it can't beat DXG-2 in all compute tasks.
Posted on Reply
#24
Vya Domus
boredsysadminFirst of all AMD Project 47 is 1PT FULL and 1/2 PF half. So to reach 2PF you'd need 2 racks. Compare it approx 8U for DGX-2. Power usage for 2 racks of P47th - 66kw vs 10kW for a single DGX-2
Cost? I will absolutely guarantee you that two full racks or 40 servers with top-end CPU, GPU, and some local storage would cost SIGNIFICANTLY more than 400k. I guarantee it. I couldn't find a single estimation, but AMD compares it original IBM's Roadrunner which cost around $100m, AMD would cost "much less", means at least few millions.
I'm not nVidia fanboi, but numbers don't stack well for AMD here. sorry.
One should take a better look at these things before drawing such a conclusion , Nvidia is very specific with their wording. Those 2 PFlops come with the help of Tensor Cores, it should go without saying that this sort of performance is not fully comparable with traditional heterogeneous computing.

Do not live under the false impression that Nvidia has some sort of magic sauce that no one else can conjure up. It's just a lot of dedicated silicon designed for a specific set of tasks. I can guarantee you that in the majority of cases AMD's traditional system is faster and more cost effective while Nvidia's only truly crushes it under very , very specific scenarios. Notice how Jensen talks about this stuff pretty much exclusively within the context of CNNs and that sort of stuff because that's really the only area they've focused on.
Posted on Reply
#25
Patriot
Vya DomusOne should take a better look at these things before drawing such a conclusion , Nvidia is very specific with their wording. Those 2 PFlops come with the help of Tensor Cores, it should go without saying that this sort of performance is not fully comparable with traditional heterogeneous computing.

Do not live under the false impression that Nvidia has some sort of magic sauce that no one else can conjure up. It's just a lot of dedicated silicon designed for a specific set of tasks. I can guarantee you that in the majority of cases AMD's traditional system is faster and more cost effective while Nvidia's only truly crushes it under very , very specific scenarios. Notice how Jensen talks about this stuff pretty much exclusively within the context of CNNs and that sort of stuff because that's really the only area they've focused on.
Vega 7nm later this year should have tensor cores as well. But yes, they are well over a year behind on the tensor compute. I have been keeping tags on the ROCm dev for the past few years and it has been making major strides forward in making a competitive ecosystem. Clearly still not there but it is hard to close a gap that has been 5+ years in the making. Nvidia and Intel are both not part of GenZ and are basically showing themselves to be taking on the rest of the consortium.
Posted on Reply
Add your own comment
Oct 5th, 2024 19:12 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts