Wednesday, September 30th 2020

Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

Sep 30th, 2020 03:03 Discuss (23 Comments)

Intel and Heidelberg University Computing Center (URZ) today announced that they have established oneAPI Academic Center of Excellence (CoE) at UZR. The newly established CoE has a goal to further develop Intel's oneAPI standard and enable it to work on AMD GPUs. This information is a bit shocking, however, Intel believes that the technology should work on a wide range of processors, no matter the vendor. The heterogeneous hardware programming is the main goal here. In a Twitter thread, an Intel employee specifies that Intel has also been working with Arm and NVIDIA to bring Data-Parallel C++ (DPC++), a core of oneAPI, to those vendors as well. That should bring this universal programming model to every device and adapt to every platform, which is a goal of heterogeneous programming - whatever you need to program a CPU, GPU, or some other ASIC, it is covered by a single API, specifically oneAPI.

UZRURZ's work as a oneAPI CoE will add advanced DPC++ capabilities into hipSYCL, which supports systems based on AMD GPUs, NVIDIA GPUs, and CPUs. New DPC++ extensions are part of the SYCL 2020 provisional specification that brings features such as unified shared memory to hipSYCL and the platforms it supports - furthering the promise of oneAPI application support across system architectures and vendors.

Sources: URZ, David Schor (Twitter)

Add your own comment

23 Comments on Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

DeathtoGnomes

oneAPI will fail, the 'vendors' will see to it.

Vya Domus

There is SYCL, there is also ACC and who knows what else, what makes this special I don't know.

bug

I would have picked Rust over C++ for this, simply because security across so many different devices is going to be a b*tch.

Mouth of Sauron

How this is different from HSA, where Intel was very actively disrupting progress?

Voluman

Mouth of SauronHow this is different from HSA, where Intel was very actively disrupting progress?

HSA wasn't intel idea, this one, is :)

Cheeseball

Not a Potato

Vya DomusThere is SYCL, there is also ACC and who knows what else, what makes this special I don't know.

oneAPI currrently is basically SYCL with certain Intel extensions (which also run on AMD/NVIDIA hardware). The extensions are being able to access certain AVX2 and AVX-512 instructions and its subsets.

gamefoo21

If OneAPI gains traction, CUDA is screwed.

CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.

If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.

bug

gamefoo21If OneAPI gains traction, CUDA is screwed.

CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.

If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.

Not sure how Nvidia breaks those. I haven't read about Vulkan Compute, but Nvidia has implemented OpenCL years ago. Still, everybody prefers CUDA.

dragontamer5788

bugI would have picked Rust over C++ for this, simply because security across so many different devices is going to be a b*tch.

These GPU-APIs are all about sharing pointers, tree structures, and more between CPU and GPU without skipping a beat. You can have a giant graph created with your CPU code, copied to the GPU, and then the pointers traversed in parallel in the GPU these days. (NVidia, Intel, or AMD).

Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.

#10

bug

dragontamer5788These GPU-APIs are all about sharing pointers, tree structures, and more between CPU and GPU without skipping a beat. You can have a giant graph created with your CPU code, copied to the GPU, and then the pointers traversed in parallel in the GPU these days. (NVidia, Intel, or AMD).

Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.

Exactly. It's that pointer arithmetic that causes all sorts of hard to trace crashes (or worse, silent errors), so why would they still drag it around?

#11

dragontamer5788

bugExactly. It's that pointer arithmetic that causes all sorts of hard to trace crashes (or worse, silent errors), so why would they still drag it around?

Performance. Which is why people are using GPUs in the first place (and why you can now share pointers between GPU code and CPU code).

#12

bug

dragontamer5788Performance. Which is why people are using GPUs in the first place (and why you can now share pointers between GPU code and CPU code).

I very much doubt that. But since I'm out of touch with these APIs, I'll take your word for it.

#13

dragontamer5788

bugI very much doubt that. But since I'm out of touch with these APIs, I'll take your word for it.

software.intel.com/content/www/us/en/develop/articles/opencl-20-shared-virtual-memory-overview.html

developer.nvidia.com/blog/unified-memory-cuda-beginners/

One of the remarkable features of OpenCL™ 2.0 is shared virtual memory (SVM). This feature enables OpenCL developers to write code with extensive use of pointer-linked data structures like linked lists or trees that are shared between the host and a device side of an OpenCL application.

Its been available on CUDA for even longer. I don't remember the history exactly, but OpenCL 2.0 could do this way back in like 2013 or whatever. This feature is probably 10 years old, if we include whenever it started working on CUDA.

-------

EDIT: Think about raytracing and BVH trees. How do you exactly expect the GPU and CPU to share the same BVH-tree, unless memory-pointers worked exactly the same on both systems? This feature is a necessary precursor to RTX, and other such features now coming up in today's GPUs.

#14

bug

dragontamer5788software.intel.com/content/www/us/en/develop/articles/opencl-20-shared-virtual-memory-overview.html

developer.nvidia.com/blog/unified-memory-cuda-beginners/

I don't doubt you can share pointers, I doubt it's not better to leverage Rust while at it (even if using unsafe blocks).

#15

dragontamer5788

bugI don't doubt you can share pointers, I doubt it's not better to leverage Rust while at it (even if using unsafe blocks).

If you're going to use unsafe Rust anyway, what's the benefit of Rust?

On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?

I'm not even sure how Rust's "ownership model" applies to SIMD-style or prefix-sum style code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.

The closest code that works like GPUs is OpenMP (which is only implemented in Fortran, C, and C++). Julia is kind of making progress on its own way too, but Rust is no where close to the realm of usable on GPUs. There's a certain "parallelism mindset" you need to put yourself into to write effective GPU stuff, and I'm just not seeing how Rust supports the mindset in any way.

#16

bug

dragontamer5788If you're going to use unsafe Rust anyway, what's the benefit of Rust?

On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?

I'm not even sure how Rust's "ownership model" applies to SIMD-style prefix-sum code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.

The catch is not to rewrite everything in unsafe Rust. The unsafe blocks are there because when you interface with C/ASM code, you cannot actually guarantee much about it. But you can still enforce the rules for the rest of the code.
I mean, look at this: github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.

#17

dragontamer5788

bugThe catch is not to rewrite everything in unsafe Rust. The unsafe blocks are there because when you interface with C/ASM code, you cannot actually guarantee much about it. But you can still enforce the rules for the rest of the code.
I mean, look at this: github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.

Lets get a bit more technical, I think this high-level discussion we're having is getting in the way.

In CUDA, when you want to get a section of memory from the GPU, you call void* gpuPTR = cudaMalloc(size_of_blah); There are many kinds of cudaMalloc, depending on some details which are pretty important to performance. This is all CPU-side still, we haven't even touched GPU-code yet.

Once you've set up the data-structures inside of this gpuPTR as appropriate, you can send the pointer to the GPU with a kernel invocation, such as "fooBar<<<2, 64>>>(gpuPTR)", representing 2x64 cudaThreads of fooBar to be run on the GPU, with gpuPTR being passed to all 128 cudaThreads of them. After you call this, your CPU code is running in parallel with the GPU code.

fooBar is any __global__ specified C++ function, such as:

__global__ void fooBar(void* gpuPTR){
// C++ code here
}

Now, I presume you want to write Rust code for fooBar. Where exactly will you be able to ensure memory-safety of the gpuPTR ? Which of the 128-threads has "ownership" of the memory? Or do you leave the CPU with ownership?

Also, "CUDA threads" do not have the same characteristics as "real" CPU Threads. Its an abstraction (one that NVidia keeps getting their GPUs closer and closer to over time... but its still not quite perfect). For example, 32-threads is the minimum practical CUDA-thread count. Intra-block threads can communicate effectively, but inter-block threads cannot communicate very easily (we have two thread blocks of 64: meaning thread 0 through 63 can communicate effectively, but thread0 and thread75 cannot. Thread75 is in the block of threads64 to thread128 block).

EDIT: Ah, to finish my point. It seems to me like the entirety of GPU-code (ie: everything inside of the __global__ fooBar function) will be inherently unsafe. Even if you made a safeCudaMalloc() that was managed on the CPU side, the poor communication mechanisms of GPU-blocks (ie: Thread#0 vs Thread#75) makes any such "memory-safety communication" on the GPU-side a fool's errand. It doesn't seem like GPU-side code could be written in safe-Rust at all, at least by my opinion.

#18

bug

@dragontamer5788 Well, I'm only a beginner in Rust and my C/C++ days are way behind me, but the thing is, if you want to change that memory, you can't in Rust. Only one pointer can have write access at any given time. So you'd need to split that up somehow. If you only need to read that zone and put your results elsewhere, then you can have shared ownership, no problem.
The upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").

#19

dragontamer5788

bugThe upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").

Given the huge thread counts that are common in GPU code (64 is... very tiny. I've made 1024-sized blocks on regular basis, and you need 65536 cudaThreads on a Vega64 before you even have decent performance)... its unlikely that you'll ever be able to guarantee "one function with write access".

Furthermore, a gpuKernel call spawns many identical copies of the same program. There can be a degree of cudaThread communication within a block (and even outside of the block if you're willing to accept severe performance penalties). Its just the efficient execution of code is the primary goal when writing GPU stuffs. Which means you're inherently going to have many of your threads reading, AND writing, these blocks of RAM.

The standard "GPU Style" is prefix-sum to coordinate who is writing, and where, to minimize issues. I suggest reading this paper for GPU-Stream Compaction, which is highly efficient, and allows an entire block of cudaThreads (up to 1024 cudaThreads) to efficiently read/write to the same array without stepping on each other's toes: www.cse.chalmers.se/~uffe/streamcompaction.pdf. I simply don't see how this kind of (very common) GPU-style of code can ever be written in "Safe Rust".

#20

bug

dragontamer5788Given the huge thread counts that are common in GPU code (64 is... very tiny. I've made 1024-sized blocks on regular basis, and you need 65536 threads on a Vega64 before you even have decent performance)... its unlikely that you'll ever be able to guarantee "one function with write access".

Well, you already do that in C/C++ only you need to synchronize access for that to happen. And when you fail to do that, you get memory corruption and we're back where we started.
But it's true, with Rust you'd need to rethink the code and feed each thread just the data it's supposed to modify.

Somewhat unrelated, but even my novice understanding of Rust's way of thinking has enabled me to model problems much better, even if not programming in Rust.

#21

dragontamer5788

bugWell, you already do that in C/C++ only you need to synchronize access for that to happen. And when you fail to do that, you get memory corruption and we're back where we started.

GPU-style is heavily based on GPU-barriers. As long as everyone reads at the same time, then steps together, then writes at the same time, you're actually pretty safe. You also need assurances that all writes are to different locations (which is somewhat straightforward to prove, and happens pretty often in practice). If you cannot prove that all writes are to different locations, you can still have safe concurrent read-modify-writes by using Atomic-operations (with a severe performance penalty. You don't want to use Atomics unless an aliasing issue is at hand).

EDIT:

Just editing a picture from the paper I posted a bit earlier. Showing how an array can be stream-compacted in parallel safely and simply. For clarity, there are 16-threads (0 through 15), compacting this array.

This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.

#22

bug

dragontamer5788GPU-style is heavily based on GPU-barriers. As long as everyone reads at the same time, then steps together, then writes at the same time, you're actually pretty safe.

This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.

So you only need to get everyone to do three things together, in sequence, at the same time. Super-safe, what can possibly go wrong? :P

Joking aside, you can use C/C++ for the GPU if that's safe enough, while using Rust for the more mundane code elsewhere. That's all I was trying to say.

#23

dragontamer5788

bugSo you only need to get everyone to do three things together, in sequence, at the same time. Super-safe, what can possibly go wrong? :p

Each synchronization is just a __syncthreads() call. docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

And its highly efficient on GPUs. Don't knock it until you try it :cool:

Add your own comment

Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

23 Comments on Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

Related News

23 Comments on Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts