Wednesday, September 30th 2020
Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence
Intel and Heidelberg University Computing Center (URZ) today announced that they have established oneAPI Academic Center of Excellence (CoE) at UZR. The newly established CoE has a goal to further develop Intel's oneAPI standard and enable it to work on AMD GPUs. This information is a bit shocking, however, Intel believes that the technology should work on a wide range of processors, no matter the vendor. The heterogeneous hardware programming is the main goal here. In a Twitter thread, an Intel employee specifies that Intel has also been working with Arm and NVIDIA to bring Data-Parallel C++ (DPC++), a core of oneAPI, to those vendors as well. That should bring this universal programming model to every device and adapt to every platform, which is a goal of heterogeneous programming - whatever you need to program a CPU, GPU, or some other ASIC, it is covered by a single API, specifically oneAPI.
Sources:
URZ, David Schor (Twitter)
UZRURZ's work as a oneAPI CoE will add advanced DPC++ capabilities into hipSYCL, which supports systems based on AMD GPUs, NVIDIA GPUs, and CPUs. New DPC++ extensions are part of the SYCL 2020 provisional specification that brings features such as unified shared memory to hipSYCL and the platforms it supports - furthering the promise of oneAPI application support across system architectures and vendors.
23 Comments on Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence
CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.
If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.
Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.
developer.nvidia.com/blog/unified-memory-cuda-beginners/ Its been available on CUDA for even longer. I don't remember the history exactly, but OpenCL 2.0 could do this way back in like 2013 or whatever. This feature is probably 10 years old, if we include whenever it started working on CUDA.
-------
EDIT: Think about raytracing and BVH trees. How do you exactly expect the GPU and CPU to share the same BVH-tree, unless memory-pointers worked exactly the same on both systems? This feature is a necessary precursor to RTX, and other such features now coming up in today's GPUs.
On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?
I'm not even sure how Rust's "ownership model" applies to SIMD-style or prefix-sum style code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.
The closest code that works like GPUs is OpenMP (which is only implemented in Fortran, C, and C++). Julia is kind of making progress on its own way too, but Rust is no where close to the realm of usable on GPUs. There's a certain "parallelism mindset" you need to put yourself into to write effective GPU stuff, and I'm just not seeing how Rust supports the mindset in any way.
I mean, look at this: github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.
In CUDA, when you want to get a section of memory from the GPU, you call void* gpuPTR = cudaMalloc(size_of_blah); There are many kinds of cudaMalloc, depending on some details which are pretty important to performance. This is all CPU-side still, we haven't even touched GPU-code yet.
Once you've set up the data-structures inside of this gpuPTR as appropriate, you can send the pointer to the GPU with a kernel invocation, such as "fooBar<<<2, 64>>>(gpuPTR)", representing 2x64 cudaThreads of fooBar to be run on the GPU, with gpuPTR being passed to all 128 cudaThreads of them. After you call this, your CPU code is running in parallel with the GPU code.
fooBar is any __global__ specified C++ function, such as:
Now, I presume you want to write Rust code for fooBar. Where exactly will you be able to ensure memory-safety of the gpuPTR ? Which of the 128-threads has "ownership" of the memory? Or do you leave the CPU with ownership?
Also, "CUDA threads" do not have the same characteristics as "real" CPU Threads. Its an abstraction (one that NVidia keeps getting their GPUs closer and closer to over time... but its still not quite perfect). For example, 32-threads is the minimum practical CUDA-thread count. Intra-block threads can communicate effectively, but inter-block threads cannot communicate very easily (we have two thread blocks of 64: meaning thread 0 through 63 can communicate effectively, but thread0 and thread75 cannot. Thread75 is in the block of threads64 to thread128 block).
EDIT: Ah, to finish my point. It seems to me like the entirety of GPU-code (ie: everything inside of the __global__ fooBar function) will be inherently unsafe. Even if you made a safeCudaMalloc() that was managed on the CPU side, the poor communication mechanisms of GPU-blocks (ie: Thread#0 vs Thread#75) makes any such "memory-safety communication" on the GPU-side a fool's errand. It doesn't seem like GPU-side code could be written in safe-Rust at all, at least by my opinion.
The upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").
Furthermore, a gpuKernel call spawns many identical copies of the same program. There can be a degree of cudaThread communication within a block (and even outside of the block if you're willing to accept severe performance penalties). Its just the efficient execution of code is the primary goal when writing GPU stuffs. Which means you're inherently going to have many of your threads reading, AND writing, these blocks of RAM.
The standard "GPU Style" is prefix-sum to coordinate who is writing, and where, to minimize issues. I suggest reading this paper for GPU-Stream Compaction, which is highly efficient, and allows an entire block of cudaThreads (up to 1024 cudaThreads) to efficiently read/write to the same array without stepping on each other's toes: www.cse.chalmers.se/~uffe/streamcompaction.pdf. I simply don't see how this kind of (very common) GPU-style of code can ever be written in "Safe Rust".
But it's true, with Rust you'd need to rethink the code and feed each thread just the data it's supposed to modify.
Somewhat unrelated, but even my novice understanding of Rust's way of thinking has enabled me to model problems much better, even if not programming in Rust.
EDIT:
Just editing a picture from the paper I posted a bit earlier. Showing how an array can be stream-compacted in parallel safely and simply. For clarity, there are 16-threads (0 through 15), compacting this array.
This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.
Joking aside, you can use C/C++ for the GPU if that's safe enough, while using Rust for the more mundane code elsewhere. That's all I was trying to say.
And its highly efficient on GPUs. Don't knock it until you try it :cool: