• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

X-Silicon Startup Wants to Combine RISC-V CPU, GPU, and NPU in a Single Processor

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,552 (0.97/day)
While we are all used to having a system with a CPU, GPU, and, recently, NPU—X-Silicon Inc. (XSi), a startup founded by former Silicon Valley veterans—has unveiled an interesting RISC-V processor that can simultaneously handle CPU, GPU, and NPU workloads in a chip. This innovative chip architecture, which will be open-source, aims to provide a flexible and efficient solution for a wide range of applications, including artificial intelligence, virtual reality, automotive systems, and IoT devices. The new microprocessor combines a RISC-V CPU core with vector capabilities and GPU acceleration into a single chip, creating a versatile all-in-one processor. By integrating the functionality of a CPU and GPU into a single core, X-Silicon's design offers several advantages over traditional architectures. The chip utilizes the open-source RISC-V instruction set architecture (ISA) for both CPU and GPU operations, running a single instruction stream. This approach promises lower memory footprint execution and improved efficiency, as there is no need to copy data between separate CPU and GPU memory spaces.

Called the C-GPU architecture, X-Silicon uses RISC-V Vector Core, which has 16 32-bit FPUs and a Scaler ALU for processing regular integers as well as floating point instructions. A unified instruction decoder feeds the cores, which are connected to a thread scheduler, texture unit, rasterizer, clipping engine, neural engine, and pixel processors. All is fed into a frame buffer, which feeds the video engine for video output. The setup of the cores allows the users to program each core individually for HPC, AI, video, or graphics workloads. Without software, there is no usable chip, which prompts X-Silicon to work on OpenGL ES, Vulkan, Mesa, and OpenCL APIs. Additionally, the company plans to release a hardware abstraction layer (HAL) for direct chip programming. According to Jon Peddie Research (JPR), the industry has been seeking an open-standard GPU that is flexible and scalable enough to support various markets. X-Silicon's CPU/GPU hybrid chip aims to address this need by providing manufacturers with a single, open-chip design that can handle any desired workload. The XSi gave no timeline, but it has plans to distribute the IP to OEMs and hyperscalers, so the first silicon is still away.



View at TechPowerUp Main Site | Source
 
Joined
Oct 26, 2005
Messages
656 (0.09/day)
Location
Madrid, Spain
System Name Sample light
Processor AMD Phenom II 965
Motherboard MSI DKA790GX
Cooling Zalman 9500LED
Memory 2x 2GB G-skill PC8500
Video Card(s) Asus built AMD/ATI HD4850
Storage 500GB Wesern Digital green
Display(s) too ashamed to tell, but its an acer
Case Lian Li V2000
Audio Device(s) Echo Audiofire 2
Power Supply Corsair VX550
Software Windows XP pro SP3
i suppose that using the architecture thats most efficient for each unit would be an idea. but wouldnt the code translation between different cpu segments connected to the pipeline basically cripple any performance gains ?

sounds like a startup using buzzwords to get money.
 
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
I doubt that's related, could be some mistranslation. I cannot imagine what a "transcendental math unit" is, transcendental numbers have the ironic property of not being computable as they can only be approximated. So there is nothing a dedicated unit could achieve that you couldn't just do with whatever other generic piece of hardware.
 
Joined
Jun 29, 2018
Messages
534 (0.23/day)
I doubt that's related, could be some mistranslation. I cannot imagine what a "transcendental math unit" is, transcendental numbers have the ironic property of not being computable as they can only be approximated. So there is nothing a dedicated unit could achieve that you couldn't just do with whatever other generic piece of hardware.
It's not a mistranslation. Logarithm, exponent, trigonometric and hyperbolic functions are also transcendental. You're correct that they have to (generally) be approximated, but even then it's simply more efficient to have dedicated hardware to do it. Even the first external FPUs in x86 world have had dedicated hardware acceleration for some of them. You could do FP math on integer units, but it was very inefficient. The same principle applies here.
NVIDIA for example has been using Special Function Units (SFU) to handle them, located along more general CUDA cores inside Streaming Multiprocessors (SM) since at least G80. In one form or another they have been present in every graphics-related acceleration implementation.
 
Joined
Nov 26, 2021
Messages
1,633 (1.51/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
I doubt that's related, could be some mistranslation. I cannot imagine what a "transcendental math unit" is, transcendental numbers have the ironic property of not being computable as they can only be approximated. So there is nothing a dedicated unit could achieve that you couldn't just do with whatever other generic piece of hardware.
Approximations are good enough for many purposes and transcendental functions include sine, cosine, reciprocal, and square root. These are used in 3D graphics as well.
 
Joined
Jan 10, 2011
Messages
1,443 (0.29/day)
Location
[Formerly] Khartoum, Sudan.
System Name 192.168.1.1~192.168.1.100
Processor AMD Ryzen5 5600G.
Motherboard Gigabyte B550m DS3H.
Cooling AMD Wraith Stealth.
Memory 16GB Crucial DDR4.
Video Card(s) Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s) ViewSonic VA2406-MH 75Hz
Case Bitfenix Nova Midi
Audio Device(s) On-Board.
Power Supply SeaSonic CORE GM-650.
Mouse Logitech G300s
Keyboard Kingston HyperX Alloy FPS.
VR HMD A pair of OP spectacles.
Software Ubuntu 24.04 LTS.
Benchmark Scores Me no know English. What bench mean? Bench like one sit on?
This approach promises lower memory footprint execution and improved efficiency, as there is no need to copy data between separate CPU and GPU memory spaces.
We already have that with UMA.

I don't really see a potential for this idea. Traditional SoCs have already tried minimizing the physical gap between general and graphics processors with mixed results. Sure you get simpler memory management and lower latency, but you replace it with lower power envelope and much more limited real estate. Merging all aspects into one makes those limitations even worse, plus you add a much more complex scheduler and a most likely humongous instruction set.

A jack of all trade is a master of none.
 
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
NVIDIA for example has been using Special Function Units (SFU) to handle them, located along more general CUDA cores inside Streaming Multiprocessors (SM) since at least G80. In one form or another they have been present in every graphics-related acceleration implementation.
This is a misunderstanding, those "units" don't do anything special, it's just a bunch of instructions that implement a subroutine wrapped into a single machine code instruction so that the programmer doesn't have to do that on it's own. PTX has an instruction to compute "sin" of something for example but there is obviously no such thing as a sin instruction in hardware, it's just an abstraction for a bunch of other instructions.
 
Joined
Jun 29, 2018
Messages
534 (0.23/day)
This is a misunderstanding, those "units" don't do anything special, it's just a bunch of instructions that implement a subroutine wrapped into a single machine code instruction so that the programmer doesn't have to do that on it's own. PTX has an instruction to compute "sin" of something for example but there is obviously no such thing as a sin instruction in hardware, it's just an abstraction for a bunch of other instructions.
The CUDA C++ Programming Guide specifically mentions hardware SFUs for every Compute Capability up to the latest 9.0. It's because they are still used for approximations of transcendentals, sometimes with additional general CUDA cores being involved. The implementations fluctuated from version to version, but those hardware units are still faster than general approaches.

Edit: NVIDIA published a paper on SFU's design in 2005, you can find it on IEEE Xplore if you have access or on Anna's Archive if you don't via DOI:10.1109/ARITH.2005.7
While it's an older paper it explains why a specialized hardware implementation was beneficial vs. a general approach.
 
Last edited:
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
The CUDA C++ Programming Guide specifically mentions hardware SFUs for every Compute Capability up to the latest 9.0. It's because they are still used for approximations of transcendentals, sometimes with additional general CUDA cores being involved.
You don't understand what I am saying, this is similar to integer multiplication, there is no integer multiplication in hardware, it's just a subroutine that adds integers repeatedly, most early processors didn't have a mul instruction because it wasn't seen as necessary only later did most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop, it's said that the operation is implemented in hardware but it's just running other micro ops under the hood. Same thing applies for matrix or tensor ops.

The implementations fluctuated from version to version, but those hardware units are still faster than general approaches.
There is actually no way to know this since Nvidia doesn't publish their ISA, so no one has handwritten a subroutine for trigonometric functions so that we'd know for a fact that it's slower. Again, there is no secret sauce, if you look into the output of an x86 compiler you'll see that they don't even use those trigonometric instructions because you can write faster versions based on lookup tables which make no sense to implement in hardware and I have no reason to believe it's any different on GPUs.
 
Joined
Jun 29, 2018
Messages
534 (0.23/day)
You don't understand what I am saying, this is similar to integer multiplication, there is no integer multiplication in hardware, it's just a subroutine that adds integers repeatedly, most early processors didn't have a mul instruction because it wasn't seen as necessary only later did most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop, it's said that the operation is implemented in hardware but it's just running other micro ops under the hood. Same thing applies for matrix or tensor ops.


There is actually no way to know this since Nvidia doesn't publish their ISA, so no one has handwritten a subroutine for trigonometric functions so that we'd know for a fact that it's slower. Again, there is no secret sauce, if you look into the output of an x86 compiler you'll see that they don't even use those trigonometric instructions because you can write faster versions based on lookup tables which make no sense to implement in hardware and I have no reason to believe it's any different on GPUs.
I understood what you meant.
Did you even look at the paper? It explains their implementation which is using lookup tables in ROM because doing it "by hand" every time is wasteful.
Anyway we know enough about NVIDIA SASS to know that instructions containing the MUFU. prefix are handled by the SFUs.
You can take a simple example:

C:
#include <stdio.h>

#define ITER 1024

__global__ void do_rsqrtf(float in) {
    for (int i = 0; i < ITER; i++)
      in = rsqrtf(in);
    if (!in) printf("%f\n", in);
}

int main() {
    do_rsqrtf<<<256*ITER, 256>>>(1.0f);
}

Compile it with
Code:
nvcc -arch=sm_86 -use_fast_math
and you'll get:

PTX by cuobjdump -ptx:
Code:
[...]
rsqrt.approx.ftz.f32 %f4, %f67;
rsqrt.approx.ftz.f32 %f5, %f4;
rsqrt.approx.ftz.f32 %f6, %f5;
[...]

and SASS by cuobjdump -sass:
Code:
[...]
       /*0070*/                   MUFU.RSQ R3, R2 ;                         /* 0x0000000200037308 */
                                                                             /* 0x001e300000001400 */
        /*0080*/                   MUFU.RSQ R3, R3 ;                         /* 0x0000000300037308 */
                                                                             /* 0x001e300000001400 */
        /*0090*/                   MUFU.RSQ R4, R3 ;                         /* 0x0000000300047308 */
                                                                             /* 0x001e300000001400 */
        /*00a0*/                   MUFU.RSQ R4, R4 ;                         /* 0x0000000400047308 */

[...]

So that's using SFUs directly. Remove the -use_fast_math and you'll get a hybrid version:

Code:
[...]
/*0080*/              @!P0 FMUL R5, R5, 16777216 ;                                     /* 0x4b80000005058820 */
                                                                                               /* 0x000fc80000400000 */
        /*0090*/                   MUFU.RSQ R0, R5 ;                                           /* 0x0000000500007308 */
                                                                                               /* 0x000e240000001400 */
        /*00a0*/              @!P0 FMUL R0, R0, 4096 ;                                         /* 0x4580000000008820 */
                                                                                               /* 0x001fca0000400000 */
        /*00b0*/                   FSETP.GEU.AND P0, PT, |R0|, 1.175494350822287508e-38, PT ;  /* 0x008000000000780b */
                                                                                               /* 0x000fda0003f0e200 */
        /*00c0*/              @!P0 FMUL R0, R0, 16777216 ;                                     /* 0x4b80000000008820 */
                                                                                               /* 0x000fc80000400000 */
        /*00d0*/                   MUFU.RSQ R3, R0 ;                                           /* 0x0000000000037308 */
                                                                                               /* 0x000e240000001400 */
        /*00e0*/              @!P0 FMUL R3, R3, 4096 ;                                         /* 0x4580000003038820 */
                                                                                               /* 0x001fca0000400000 */
        /*00f0*/                   FSETP.GEU.AND P0, PT, |R3|, 1.175494350822287508e-38, PT ;  /* 0x008000000300780b */
[...]
 
Joined
Aug 31, 2021
Messages
21 (0.02/day)
Processor Ryzen 9 5900X
Motherboard Gigabyte Aorus B550i pro ax
Cooling Noctua NH-D15 chromax.black (with original fans)
Memory G.skill 32GB 3200MHz
Video Card(s) 4060ti 16GB
Storage 1TB Samsung PM9A1, 256GB Toshiba pcie3 (from a laptop), 512GB crucial MX500, 2x 1TB Toshiba HDD 2.5
Display(s) Mateview GT 34''
Case Thermaltake the tower 100, 1 noctua NF-A14 ippc3000 on top, 2 x Arctic F14
Power Supply Seasonic focus-GX 750W
Software Windows 10, Ubuntu when needed.
most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop
What the hell are you saying? There are HW units to do multiplications and I don't know ANY microprocessor (even little microcontrollers) which doesn't include them. And there are plenty of open-source microprocessors around, have a look to those if you don't believe me. Saying there are internal micro-operations to perform multiplications is crazy.
 
Joined
May 7, 2005
Messages
33 (0.00/day)
Location
Budapest, Hungary
This is what AMD wanted to achieve with project "Fusion" shortly after buying ATi back in 2006:
I still remember the image above. :)

Step3:
"The final step in the evolution of Fusion is where the CPU and GPU are truly integrated, and the GPU is accessed by user mode instructions just like the CPU. You can expect to talk to the GPU via extensions to the x86 ISA, and the GPU will have its own register file (much like FP and integer units each have their own register files)."

Link to this article:
 
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Saying there are internal micro-operations to perform multiplications is crazy.
That's literally how it's done, you cannot simply multiply two integers in the same way you can add them in binary, multiply operations take several cycles on all processors out there for this very reason. Just look it up, there is nothing crazy about it, same applies to division.

The processor inside the C64 famously did not have a multiply instruction. You may find it odd but most processors today simply emulate instructions, it's not done for speed, back then they didn't do it because it simply took too much space.

Anyway we know enough about NVIDIA SASS to know that instructions containing the MUFU. prefix are handled by the SFUs.
This doesn't really say anything other that what is already known, SASS actually gets compiled into micro ops, what those micro ops are no one knows. Anyway all of this is still in line with what I am saying, these instructions are just emulated, whether you do it at micro op level or in assembly it should make no difference. Why do you think that if you write modern C code using transcendental functions the compiler wont output any special instructions ? Intel and AMD are just stupid ? They don't know how to make dedicated really fast hardware for transcendental ops ? No, it just doesn't matter.

GPU manufactures are kind of forced to add these at the ISA level, because you can't get portable code and they write the compilers themselves and every time they make a new GPU they usually change the ISA as well, so they have to provide these primitives, not for speed reasons but for ease of use. You can't just plop in some library that does all this when you program on a GPU like you do on a CPU.
 
Last edited:
Joined
Jun 29, 2018
Messages
534 (0.23/day)
This doesn't really say anything other that what is already known, SASS actually gets compiled into micro ops, what those micro ops are no one knows. Anyway all of this is still in line with what I am saying, these instructions are just emulated, whether you do it at micro op level or in assembly it should make no difference.
And what do you think runs those microops? The same applies to your argument with IMUL - even when they are split into different operations internally they still run only on dedicated hardware optimized for those operations. Execution ports in x86 CPUs aren't uniform, that's why so much of the core is dedicated to scheduling in the first place.
Why do you think that if you write modern C code using transcendental functions the compiler wont output any special instructions ? Intel and AMD are just stupid ? They don't know how to make dedicated really fast hardware for transcendental ops ? No, it just doesn't matter.
"C code using transcendental functions" - you mean calling a math library that has a platform-dependent implementation? Some do use direct CPU instructions, some do not, and for many different reasons.
GPU manufactures are kind of forced to add these at the ISA level, because you can't get portable code and they write the compilers themselves and every time they make a new GPU they usually change the ISA as well, so they have to provide these primitives, not for speed reasons but for ease of use. You can't just plop in some library that does all this when you program on a GPU like you do on a CPU.
CUDA itself is the library you mention in the last sentence. That's why it's so powerful and popular. You can go lower with PTX, which is portable by the way, or attempt SASS analysis for the lowest available level. Then there's microarchitectural level what is not available for mere mortals, but that's not different than x86.

Why does every NVIDIA documentation piece, every architectural diagram of SMs mention SFUs if they are simply "emulated"? Why bother with specifying the number of units per SM for each generation?
You can even test it yourself by occupying all CUDA cores and then launching operations directly on SFUs.
 
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
The same applies to your argument with IMUL - even when they are split into different operations internally they still run only on dedicated hardware optimized for those operations. Execution ports in x86 CPUs aren't uniform, that's why so much of the core is dedicated to scheduling in the first place.
I don't know what you mean, the micro ops are always executed in the same way, they are the primitives the processor schedules, every micro op corresponds to a piece of hardware, there is no slower or faster hardware block that a micro op can run on.

Some do use direct CPU instructions, some do not, and for many different reasons.
I'd be amazed if you can find even one modern example. It still doesn't explain why you'll basically never see instructions like FSIN in the output of something like clang.

You can go lower with PTX, which is portable by the way, or attempt SASS analysis for the lowest available level. Then there's microarchitectural level what is not available for mere mortals, but that's not different than x86.
PTX is portable (most of the time), SASS is usually not and the binary is definitely not portable. It is different than x86 because with that everything down to the binary can be portable, this is also why you'd want to avoid using emulated or obscure instructions, if you look at the output of every compiler out there they pretty much always use the same few instructions even if there is a more comprehensive alternative that can achieve the same thing, it just doesn't matter, there is nothing to be gained.

every architectural diagram of SMs mention SFUs if they are simply "emulated"?
Because it's not relevant, in the same way that it's not relevant to elaborate on how a matrix or tensor op is actually achieved, those are partially or completely emulated as well.

By the way the reason they mention the number of those "SFUs" is because GPU cores are usually really limited in the number or combinations of instructions they can emit per clock, so it might be useful to know how that might impact performance.
 
Last edited:
Joined
Jun 29, 2018
Messages
534 (0.23/day)
I don't know what you mean, the micro ops are always executed in the same way, they are the primitives the processor schedules, every micro op corresponds to a piece of hardware, there is no slower or faster hardware block that a micro op can run on.
That's exactly my point. Transcendentals are in the end running only on SFUs (in the single-precision with -use_fast_math case) or with SFUs as the source (in the "normal" case where rounding is done on CUDA cores) - those are specialized hardware units to approximate them.
I'd be amazed if you can find even one modern example. It still doesn't explain why you'll basically never see instructions like FSIN in the output of something like clang.
Modern example of what exactly? Different libraries using different CPU instructions for transcendentals? It's easy to look at libm implementations - there's plenty. The old x87 instructions aren't used because AMD64 basically deprecated them by making SSE2 mandatory.
PTX is portable (most of the time), SASS is usually not and the binary is definitely not portable. It is different than x86 because with that everything down to the binary can be portable, this is also why you'd want to avoid using emulated or obscure instructions, if you look at the output of every compiler out there they pretty much always use the same few instructions even if there is a more comprehensive alternative that can achieve the same thing, it just doesn't matter, there is nothing to be gained.
x86 assembly isn't portable. You can't run every binary from Rocket Lake on Alder Lake because it lacks AVX-512 despite being a newer design, for example. It can be when you take into consideration architectural differences, but that's not different from PTX and targeting CUDA CC.
Because it's not relevant, in the same way that it's not relevant to elaborate on how a matrix or tensor op is actually achieved, those are partially or completely emulated as well.
Tensor ops are running on tensor cores. Ray tracing operations run on dedicated hardware too. Obviously they can be emulated by the driver on CUDA cores, but that's a different use case.
If the previous linked NVIDIA paper wasn't enough for you then here's another one proving that in fact SFUs exist, and can be used to improve performance and significantly improve energy efficiency of certain calculations. The authors even were able to use SFUs for double-precision, which is something NVIDIA themselves doesn't support in CUDA.
 
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
The authors even were able to use SFUs for double-precision, which is something NVIDIA themselves doesn't support in CUDA.
Can you guess why ? Could it be that it's actually all just implemented generically and that's why it works for multiple data types ? Just a thought. I bet the reason the number of SFUs is so much lower than the amount of CUDA cores is because they block a lot ports in the SM when they run.

x86 assembly isn't portable.
x86 is definitely portable, those are SIMD extensions, you can take something written in the 90s and it should technically run just fine on a modern system, the same cannot be said for pretty much everything GPU related, that's why most (all, really) software compiles shaders/kernels at runtime.

You are stuck in Nvidia jargon that doesn't mean much, use your common sense, if for instance a processor has a left logical shift instruction do you think it's worth putting that into it's own little square on a diagram ? AMD has instructions for transcendental functions as well in their GPUs, they never put that into architectural diagrams because it doesn't really matter, no other GPU makers that I know of does actually.

SFUs exist in the same sense CUDA cores exist, Nvidia has no concrete definition for what a "CUDA core" is. As far as I can tell total CUDA cores = number of FP32 FMAD units, it's not a core by any definition, it doesn't fetch and decode instructions by itself, yet they insist to call it a "core". Same with these SFUs, they do not fetch and decode instructions either, the SM does since that's what's actually generating the wavefronts, so what does that unit even do ? Anyway this has been going on for too long, if you think Nvidia has some special hardware block to compute sin and cos and whatnot, that's fine, whatever.
 
Last edited:
Joined
Jan 3, 2021
Messages
3,465 (2.46/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
The x86 SSE and AVX units/instruction sets actually can't do much arithmetics besides addition and multiplication (and FMA of course). SSE and SSE2 also have division, square root and reciprocals on 128-bit vectors, and that's all. So, for everything else, there must be a "transcendental math unit", however slow and unoptimised it may be. Intel uses the same term.

Approximations are good enough for many purposes
Approximations must be good enough for nearly all purposes. (Let's forget for a moment that we share this world with accountants.)
 
Joined
Jan 8, 2017
Messages
9,414 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
So, for everything else, there must be a "transcendental math unit", however slow and unoptimised it may be. Intel uses the same term.
Those are really old x87 instructions, nothing modern is gonna generate code using that ISA, for all intents and purpose there is no transcendental math unit being used on a modern CPU.

It's slow because it's just doing a series expansion, a bunch of x - x^3/3! +x^5/5!... to calculate sin(x) for example, there is no special way to do that in hardware, it will just have to be a series of instructions to calculate that up to whatever term. You can try and use look up tables or use approximations that converge faster but it will still all come down to a bunch of adds and multiplies in series, it's unavoidable.
 
Last edited:
Joined
Jan 3, 2021
Messages
3,465 (2.46/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Those are really old x87 instructions, nothing modern is gonna generate code using that ISA, for all intents and purpose there is no transcendental math unit being used on a modern CPU.

It's slow because it's just doing a series expansion, a bunch of x - x^3/3! +x^5/5!... to calculate sin(x) for example, there is no special way to do that in hardware, it will just have to be a series of instructions to calculate that up to whatever term. You can try and use look up tables or use approximations that converge faster but it will still all come down to a bunch of adds and multiplies in series, it's unavoidable.
Yes. x87 transcendental functions are slow and accurate, and there are no "fast and dirty" variants with ~32-bit accuracy for example, which would often suffice.

However, at least the scalar integer and FP multiplications have been optimised to no end - and they execute in *one clock cycle* on recent CPUs at least since Skylake and Zen 1. Vector integer and double-precision FP multiplications seem to be equally fast, but I don't understand enough about data formats there. Agner Fog collected all the data you can think of.
 
Joined
May 3, 2018
Messages
2,881 (1.21/day)
You don't understand what I am saying, this is similar to integer multiplication, there is no integer multiplication in hardware, it's just a subroutine that adds integers repeatedly, most early processors didn't have a mul instruction because it wasn't seen as necessary only later did most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop, it's said that the operation is implemented in hardware but it's just running other micro ops under the hood. Same thing applies for matrix or tensor ops.
cpu's have a binary multiplier that does not merely add integers repeatedly which would be grossly inefficient. Most techniques involve computing the set of partial products, which are then summed together using binary adders. cpu mulitpliers only take ~ 3 cycles vs 1 cycle for addition IIRC.
 
Joined
Jun 29, 2018
Messages
534 (0.23/day)
Can you guess why ? Could it be that it's actually all just implemented generically and that's why it works for multiple data types ? Just a thought. I bet the reason the number of SFUs is so much lower than the amount of CUDA cores is because they block a lot ports in the SM when they run.
I don't need to guess why since it's spelled out in the paper which you refused to even look at, again. And no, it's not what you think it is.

x86 is definitely portable, those are SIMD extensions, you can take something written in the 90s and it should technically run just fine on a modern system, the same cannot be said for pretty much everything GPU related, that's why most (all, really) software compiles shaders/kernels at runtime.
No, it wouldn't just run since recently a few obscure x86 instructions were re-purposed for mitigation use like VERW. Most would run, but not all.

You are stuck in Nvidia jargon that doesn't mean much, use your common sense, if for instance a processor has a left logical shift instruction do you think it's worth putting that into it's own little square on a diagram ? AMD has instructions for transcendental functions as well in their GPUs, they never put that into architectural diagrams because it doesn't really matter, no other GPU makers that I know of does actually.
You're just making stuff up as you go? From the AMD RDNA whitepaper:
The dual compute unit includes new transcendental execution units to accelerate more
complicated math operations that are used in both graphics and general computing. Each
SIMD contains an 8-lane transcendental unit that can overlap execution with the main vector
ALUs and will complete a wavefront in four clock cycles.
Here's a diagram of Intel Xe execution units containing units for "extended math (EM) instructions, such as exp, log, and rcp":

I guess we have an industry-wide conspiracy for introducing emulated virtual hardware units to handle transcendentals. All 3 PC vendors agreed to lie at the same time in the same way I guess :)
SFUs exist in the same sense CUDA cores exist, Nvidia has no concrete definition for what a "CUDA core" is. As far as I can tell total CUDA cores = number of FP32 FMAD units, it's not a core by any definition, it doesn't fetch and decode instructions by itself, yet they insist to call it a "core". Same with these SFUs, they do not fetch and decode instructions either, the SM does since that's what's actually generating the wavefronts, so what does that unit even do ? Anyway this has been going on for too long, if you think Nvidia has some special hardware block to compute sin and cos and whatnot, that's fine, whatever.
Yes, NVIDIA is actually lying in their architectural whitepapers, in every single one. They publish scientific papers on the implementation of "sin and cos and whatnot", get it accepted to IEEE journal just for fun.
Are you a troll? You provided no sources for your claims, just how you think things work.

Edit: Added Intel.
 
Last edited:
Joined
Aug 31, 2021
Messages
21 (0.02/day)
Processor Ryzen 9 5900X
Motherboard Gigabyte Aorus B550i pro ax
Cooling Noctua NH-D15 chromax.black (with original fans)
Memory G.skill 32GB 3200MHz
Video Card(s) 4060ti 16GB
Storage 1TB Samsung PM9A1, 256GB Toshiba pcie3 (from a laptop), 512GB crucial MX500, 2x 1TB Toshiba HDD 2.5
Display(s) Mateview GT 34''
Case Thermaltake the tower 100, 1 noctua NF-A14 ippc3000 on top, 2 x Arctic F14
Power Supply Seasonic focus-GX 750W
Software Windows 10, Ubuntu when needed.
That's literally how it's done, you cannot simply multiply two integers in the same way you can add them in binary, multiply operations take several cycles on all processors out there for this very reason. Just look it up, there is nothing crazy about it, same applies to division.
The fact it is taking more than one cycle it is NOT because it is done that way. It is because of the internal "pipelining" of a single architecture to achieve higher frequency. Moreover if you want you can perform all those additions in HW in a single cycle, that's how is done in all multipliers architecture being array-like or based on booth encoding or whatever other HW structure. No one nowadays makes multiplications by recursively adding a single result in SW-like routines or microcode or whatever you are proposing, it would be too time consuming!
 
Top