Friday, April 5th 2024

X-Silicon Startup Wants to Combine RISC-V CPU, GPU, and NPU in a Single Processor

While we are all used to having a system with a CPU, GPU, and, recently, NPU—X-Silicon Inc. (XSi), a startup founded by former Silicon Valley veterans—has unveiled an interesting RISC-V processor that can simultaneously handle CPU, GPU, and NPU workloads in a chip. This innovative chip architecture, which will be open-source, aims to provide a flexible and efficient solution for a wide range of applications, including artificial intelligence, virtual reality, automotive systems, and IoT devices. The new microprocessor combines a RISC-V CPU core with vector capabilities and GPU acceleration into a single chip, creating a versatile all-in-one processor. By integrating the functionality of a CPU and GPU into a single core, X-Silicon's design offers several advantages over traditional architectures. The chip utilizes the open-source RISC-V instruction set architecture (ISA) for both CPU and GPU operations, running a single instruction stream. This approach promises lower memory footprint execution and improved efficiency, as there is no need to copy data between separate CPU and GPU memory spaces.

Called the C-GPU architecture, X-Silicon uses RISC-V Vector Core, which has 16 32-bit FPUs and a Scaler ALU for processing regular integers as well as floating point instructions. A unified instruction decoder feeds the cores, which are connected to a thread scheduler, texture unit, rasterizer, clipping engine, neural engine, and pixel processors. All is fed into a frame buffer, which feeds the video engine for video output. The setup of the cores allows the users to program each core individually for HPC, AI, video, or graphics workloads. Without software, there is no usable chip, which prompts X-Silicon to work on OpenGL ES, Vulkan, Mesa, and OpenCL APIs. Additionally, the company plans to release a hardware abstraction layer (HAL) for direct chip programming. According to Jon Peddie Research (JPR), the industry has been seeking an open-standard GPU that is flexible and scalable enough to support various markets. X-Silicon's CPU/GPU hybrid chip aims to address this need by providing manufacturers with a single, open-chip design that can handle any desired workload. The XSi gave no timeline, but it has plans to distribute the IP to OEMs and hyperscalers, so the first silicon is still away.
Sources: Jon Peddie Research, X-Silicon, via Tom's Hardware
Add your own comment

31 Comments on X-Silicon Startup Wants to Combine RISC-V CPU, GPU, and NPU in a Single Processor

#1
Denver
Uh, Transcendental math unit ?
Posted on Reply
#3
Aevum
i suppose that using the architecture thats most efficient for each unit would be an idea. but wouldnt the code translation between different cpu segments connected to the pipeline basically cripple any performance gains ?

sounds like a startup using buzzwords to get money.
Posted on Reply
#4
Vya Domus
AleksandarKen.wikipedia.org/wiki/Transcendental_number
I doubt that's related, could be some mistranslation. I cannot imagine what a "transcendental math unit" is, transcendental numbers have the ironic property of not being computable as they can only be approximated. So there is nothing a dedicated unit could achieve that you couldn't just do with whatever other generic piece of hardware.
Posted on Reply
#5
ncrs
Vya DomusI doubt that's related, could be some mistranslation. I cannot imagine what a "transcendental math unit" is, transcendental numbers have the ironic property of not being computable as they can only be approximated. So there is nothing a dedicated unit could achieve that you couldn't just do with whatever other generic piece of hardware.
It's not a mistranslation. Logarithm, exponent, trigonometric and hyperbolic functions are also transcendental. You're correct that they have to (generally) be approximated, but even then it's simply more efficient to have dedicated hardware to do it. Even the first external FPUs in x86 world have had dedicated hardware acceleration for some of them. You could do FP math on integer units, but it was very inefficient. The same principle applies here.
NVIDIA for example has been using Special Function Units (SFU) to handle them, located along more general CUDA cores inside Streaming Multiprocessors (SM) since at least G80. In one form or another they have been present in every graphics-related acceleration implementation.
Posted on Reply
#6
AnotherReader
Vya DomusI doubt that's related, could be some mistranslation. I cannot imagine what a "transcendental math unit" is, transcendental numbers have the ironic property of not being computable as they can only be approximated. So there is nothing a dedicated unit could achieve that you couldn't just do with whatever other generic piece of hardware.
Approximations are good enough for many purposes and transcendental functions include sine, cosine, reciprocal, and square root. These are used in 3D graphics as well.
Posted on Reply
#7
Shihab
AleksandarKThis approach promises lower memory footprint execution and improved efficiency, as there is no need to copy data between separate CPU and GPU memory spaces.
We already have that with UMA.

I don't really see a potential for this idea. Traditional SoCs have already tried minimizing the physical gap between general and graphics processors with mixed results. Sure you get simpler memory management and lower latency, but you replace it with lower power envelope and much more limited real estate. Merging all aspects into one makes those limitations even worse, plus you add a much more complex scheduler and a most likely humongous instruction set.

A jack of all trade is a master of none.
Posted on Reply
#8
Vya Domus
ncrsNVIDIA for example has been using Special Function Units (SFU) to handle them, located along more general CUDA cores inside Streaming Multiprocessors (SM) since at least G80. In one form or another they have been present in every graphics-related acceleration implementation.
This is a misunderstanding, those "units" don't do anything special, it's just a bunch of instructions that implement a subroutine wrapped into a single machine code instruction so that the programmer doesn't have to do that on it's own. PTX has an instruction to compute "sin" of something for example but there is obviously no such thing as a sin instruction in hardware, it's just an abstraction for a bunch of other instructions.
Posted on Reply
#9
ncrs
Vya DomusThis is a misunderstanding, those "units" don't do anything special, it's just a bunch of instructions that implement a subroutine wrapped into a single machine code instruction so that the programmer doesn't have to do that on it's own. PTX has an instruction to compute "sin" of something for example but there is obviously no such thing as a sin instruction in hardware, it's just an abstraction for a bunch of other instructions.
The CUDA C++ Programming Guide specifically mentions hardware SFUs for every Compute Capability up to the latest 9.0. It's because they are still used for approximations of transcendentals, sometimes with additional general CUDA cores being involved. The implementations fluctuated from version to version, but those hardware units are still faster than general approaches.

Edit: NVIDIA published a paper on SFU's design in 2005, you can find it on IEEE Xplore if you have access or on Anna's Archive if you don't via DOI:10.1109/ARITH.2005.7
While it's an older paper it explains why a specialized hardware implementation was beneficial vs. a general approach.
Posted on Reply
#10
Vya Domus
ncrsThe CUDA C++ Programming Guide specifically mentions hardware SFUs for every Compute Capability up to the latest 9.0. It's because they are still used for approximations of transcendentals, sometimes with additional general CUDA cores being involved.
You don't understand what I am saying, this is similar to integer multiplication, there is no integer multiplication in hardware, it's just a subroutine that adds integers repeatedly, most early processors didn't have a mul instruction because it wasn't seen as necessary only later did most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop, it's said that the operation is implemented in hardware but it's just running other micro ops under the hood. Same thing applies for matrix or tensor ops.
ncrsThe implementations fluctuated from version to version, but those hardware units are still faster than general approaches.
There is actually no way to know this since Nvidia doesn't publish their ISA, so no one has handwritten a subroutine for trigonometric functions so that we'd know for a fact that it's slower. Again, there is no secret sauce, if you look into the output of an x86 compiler you'll see that they don't even use those trigonometric instructions because you can write faster versions based on lookup tables which make no sense to implement in hardware and I have no reason to believe it's any different on GPUs.
Posted on Reply
#11
ncrs
Vya DomusYou don't understand what I am saying, this is similar to integer multiplication, there is no integer multiplication in hardware, it's just a subroutine that adds integers repeatedly, most early processors didn't have a mul instruction because it wasn't seen as necessary only later did most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop, it's said that the operation is implemented in hardware but it's just running other micro ops under the hood. Same thing applies for matrix or tensor ops.


There is actually no way to know this since Nvidia doesn't publish their ISA, so no one has handwritten a subroutine for trigonometric functions so that we'd know for a fact that it's slower. Again, there is no secret sauce, if you look into the output of an x86 compiler you'll see that they don't even use those trigonometric instructions because you can write faster versions based on lookup tables which make no sense to implement in hardware and I have no reason to believe it's any different on GPUs.
I understood what you meant.
Did you even look at the paper? It explains their implementation which is using lookup tables in ROM because doing it "by hand" every time is wasteful.
Anyway we know enough about NVIDIA SASS to know that instructions containing the MUFU. prefix are handled by the SFUs.
You can take a simple example:

#include <stdio.h>

#define ITER 1024

__global__ void do_rsqrtf(float in) {
for (int i = 0; i < ITER; i++)
in = rsqrtf(in);
if (!in) printf("%f\n", in);
}

int main() {
do_rsqrtf<<<256*ITER, 256>>>(1.0f);
}

Compile it with
nvcc -arch=sm_86 -use_fast_math
and you'll get:

PTX by cuobjdump -ptx:
[...]
rsqrt.approx.ftz.f32 %f4, %f67;
rsqrt.approx.ftz.f32 %f5, %f4;
rsqrt.approx.ftz.f32 %f6, %f5;
[...]

and SASS by cuobjdump -sass:
[...]
/*0070*/ MUFU.RSQ R3, R2 ; /* 0x0000000200037308 */
/* 0x001e300000001400 */
/*0080*/ MUFU.RSQ R3, R3 ; /* 0x0000000300037308 */
/* 0x001e300000001400 */
/*0090*/ MUFU.RSQ R4, R3 ; /* 0x0000000300047308 */
/* 0x001e300000001400 */
/*00a0*/ MUFU.RSQ R4, R4 ; /* 0x0000000400047308 */

[...]

So that's using SFUs directly. Remove the -use_fast_math and you'll get a hybrid version:

[...]
/*0080*/ @!P0 FMUL R5, R5, 16777216 ; /* 0x4b80000005058820 */
/* 0x000fc80000400000 */
/*0090*/ MUFU.RSQ R0, R5 ; /* 0x0000000500007308 */
/* 0x000e240000001400 */
/*00a0*/ @!P0 FMUL R0, R0, 4096 ; /* 0x4580000000008820 */
/* 0x001fca0000400000 */
/*00b0*/ FSETP.GEU.AND P0, PT, |R0|, 1.175494350822287508e-38, PT ; /* 0x008000000000780b */
/* 0x000fda0003f0e200 */
/*00c0*/ @!P0 FMUL R0, R0, 16777216 ; /* 0x4b80000000008820 */
/* 0x000fc80000400000 */
/*00d0*/ MUFU.RSQ R3, R0 ; /* 0x0000000000037308 */
/* 0x000e240000001400 */
/*00e0*/ @!P0 FMUL R3, R3, 4096 ; /* 0x4580000003038820 */
/* 0x001fca0000400000 */
/*00f0*/ FSETP.GEU.AND P0, PT, |R3|, 1.175494350822287508e-38, PT ; /* 0x008000000300780b */
[...]
Posted on Reply
#12
GuiltySpark
Vya Domusmost processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop
What the hell are you saying? There are HW units to do multiplications and I don't know ANY microprocessor (even little microcontrollers) which doesn't include them. And there are plenty of open-source microprocessors around, have a look to those if you don't believe me. Saying there are internal micro-operations to perform multiplications is crazy.
Posted on Reply
#13
JAKra
This is what AMD wanted to achieve with project "Fusion" shortly after buying ATi back in 2006:
images.anandtech.com/reviews/cpu/amd/roadahead/evolving2.jpg
I still remember the image above. :)

Step3:
"The final step in the evolution of Fusion is where the CPU and GPU are truly integrated, and the GPU is accessed by user mode instructions just like the CPU. You can expect to talk to the GPU via extensions to the x86 ISA, and the GPU will have its own register file (much like FP and integer units each have their own register files)."

Link to this article:
www.anandtech.com/show/2229/3
Posted on Reply
#14
Vya Domus
GuiltySparkSaying there are internal micro-operations to perform multiplications is crazy.
That's literally how it's done, you cannot simply multiply two integers in the same way you can add them in binary, multiply operations take several cycles on all processors out there for this very reason. Just look it up, there is nothing crazy about it, same applies to division.

The processor inside the C64 famously did not have a multiply instruction. You may find it odd but most processors today simply emulate instructions, it's not done for speed, back then they didn't do it because it simply took too much space.
ncrsAnyway we know enough about NVIDIA SASS to know that instructions containing the MUFU. prefix are handled by the SFUs.
This doesn't really say anything other that what is already known, SASS actually gets compiled into micro ops, what those micro ops are no one knows. Anyway all of this is still in line with what I am saying, these instructions are just emulated, whether you do it at micro op level or in assembly it should make no difference. Why do you think that if you write modern C code using transcendental functions the compiler wont output any special instructions ? Intel and AMD are just stupid ? They don't know how to make dedicated really fast hardware for transcendental ops ? No, it just doesn't matter.

GPU manufactures are kind of forced to add these at the ISA level, because you can't get portable code and they write the compilers themselves and every time they make a new GPU they usually change the ISA as well, so they have to provide these primitives, not for speed reasons but for ease of use. You can't just plop in some library that does all this when you program on a GPU like you do on a CPU.
Posted on Reply
#15
ncrs
Vya DomusThis doesn't really say anything other that what is already known, SASS actually gets compiled into micro ops, what those micro ops are no one knows. Anyway all of this is still in line with what I am saying, these instructions are just emulated, whether you do it at micro op level or in assembly it should make no difference.
And what do you think runs those microops? The same applies to your argument with IMUL - even when they are split into different operations internally they still run only on dedicated hardware optimized for those operations. Execution ports in x86 CPUs aren't uniform, that's why so much of the core is dedicated to scheduling in the first place.
Vya DomusWhy do you think that if you write modern C code using transcendental functions the compiler wont output any special instructions ? Intel and AMD are just stupid ? They don't know how to make dedicated really fast hardware for transcendental ops ? No, it just doesn't matter.
"C code using transcendental functions" - you mean calling a math library that has a platform-dependent implementation? Some do use direct CPU instructions, some do not, and for many different reasons.
Vya DomusGPU manufactures are kind of forced to add these at the ISA level, because you can't get portable code and they write the compilers themselves and every time they make a new GPU they usually change the ISA as well, so they have to provide these primitives, not for speed reasons but for ease of use. You can't just plop in some library that does all this when you program on a GPU like you do on a CPU.
CUDA itself is the library you mention in the last sentence. That's why it's so powerful and popular. You can go lower with PTX, which is portable by the way, or attempt SASS analysis for the lowest available level. Then there's microarchitectural level what is not available for mere mortals, but that's not different than x86.

Why does every NVIDIA documentation piece, every architectural diagram of SMs mention SFUs if they are simply "emulated"? Why bother with specifying the number of units per SM for each generation?
You can even test it yourself by occupying all CUDA cores and then launching operations directly on SFUs.
Posted on Reply
#16
Vya Domus
ncrsThe same applies to your argument with IMUL - even when they are split into different operations internally they still run only on dedicated hardware optimized for those operations. Execution ports in x86 CPUs aren't uniform, that's why so much of the core is dedicated to scheduling in the first place.
I don't know what you mean, the micro ops are always executed in the same way, they are the primitives the processor schedules, every micro op corresponds to a piece of hardware, there is no slower or faster hardware block that a micro op can run on.
ncrsSome do use direct CPU instructions, some do not, and for many different reasons.
I'd be amazed if you can find even one modern example. It still doesn't explain why you'll basically never see instructions like FSIN in the output of something like clang.
ncrsYou can go lower with PTX, which is portable by the way, or attempt SASS analysis for the lowest available level. Then there's microarchitectural level what is not available for mere mortals, but that's not different than x86.
PTX is portable (most of the time), SASS is usually not and the binary is definitely not portable. It is different than x86 because with that everything down to the binary can be portable, this is also why you'd want to avoid using emulated or obscure instructions, if you look at the output of every compiler out there they pretty much always use the same few instructions even if there is a more comprehensive alternative that can achieve the same thing, it just doesn't matter, there is nothing to be gained.
ncrsevery architectural diagram of SMs mention SFUs if they are simply "emulated"?
Because it's not relevant, in the same way that it's not relevant to elaborate on how a matrix or tensor op is actually achieved, those are partially or completely emulated as well.

By the way the reason they mention the number of those "SFUs" is because GPU cores are usually really limited in the number or combinations of instructions they can emit per clock, so it might be useful to know how that might impact performance.
Posted on Reply
#17
ncrs
Vya DomusI don't know what you mean, the micro ops are always executed in the same way, they are the primitives the processor schedules, every micro op corresponds to a piece of hardware, there is no slower or faster hardware block that a micro op can run on.
That's exactly my point. Transcendentals are in the end running only on SFUs (in the single-precision with -use_fast_math case) or with SFUs as the source (in the "normal" case where rounding is done on CUDA cores) - those are specialized hardware units to approximate them.
Vya DomusI'd be amazed if you can find even one modern example. It still doesn't explain why you'll basically never see instructions like FSIN in the output of something like clang.
Modern example of what exactly? Different libraries using different CPU instructions for transcendentals? It's easy to look at libm implementations - there's plenty. The old x87 instructions aren't used because AMD64 basically deprecated them by making SSE2 mandatory.
Vya DomusPTX is portable (most of the time), SASS is usually not and the binary is definitely not portable. It is different than x86 because with that everything down to the binary can be portable, this is also why you'd want to avoid using emulated or obscure instructions, if you look at the output of every compiler out there they pretty much always use the same few instructions even if there is a more comprehensive alternative that can achieve the same thing, it just doesn't matter, there is nothing to be gained.
x86 assembly isn't portable. You can't run every binary from Rocket Lake on Alder Lake because it lacks AVX-512 despite being a newer design, for example. It can be when you take into consideration architectural differences, but that's not different from PTX and targeting CUDA CC.
Vya DomusBecause it's not relevant, in the same way that it's not relevant to elaborate on how a matrix or tensor op is actually achieved, those are partially or completely emulated as well.
Tensor ops are running on tensor cores. Ray tracing operations run on dedicated hardware too. Obviously they can be emulated by the driver on CUDA cores, but that's a different use case.
If the previous linked NVIDIA paper wasn't enough for you then here's another one proving that in fact SFUs exist, and can be used to improve performance and significantly improve energy efficiency of certain calculations. The authors even were able to use SFUs for double-precision, which is something NVIDIA themselves doesn't support in CUDA.
Posted on Reply
#18
Vya Domus
ncrsThe authors even were able to use SFUs for double-precision, which is something NVIDIA themselves doesn't support in CUDA.
Can you guess why ? Could it be that it's actually all just implemented generically and that's why it works for multiple data types ? Just a thought. I bet the reason the number of SFUs is so much lower than the amount of CUDA cores is because they block a lot ports in the SM when they run.
ncrsx86 assembly isn't portable.
x86 is definitely portable, those are SIMD extensions, you can take something written in the 90s and it should technically run just fine on a modern system, the same cannot be said for pretty much everything GPU related, that's why most (all, really) software compiles shaders/kernels at runtime.
ncrshere's another one proving that in fact SFUs exist
You are stuck in Nvidia jargon that doesn't mean much, use your common sense, if for instance a processor has a left logical shift instruction do you think it's worth putting that into it's own little square on a diagram ? AMD has instructions for transcendental functions as well in their GPUs, they never put that into architectural diagrams because it doesn't really matter, no other GPU makers that I know of does actually.

SFUs exist in the same sense CUDA cores exist, Nvidia has no concrete definition for what a "CUDA core" is. As far as I can tell total CUDA cores = number of FP32 FMAD units, it's not a core by any definition, it doesn't fetch and decode instructions by itself, yet they insist to call it a "core". Same with these SFUs, they do not fetch and decode instructions either, the SM does since that's what's actually generating the wavefronts, so what does that unit even do ? Anyway this has been going on for too long, if you think Nvidia has some special hardware block to compute sin and cos and whatnot, that's fine, whatever.
Posted on Reply
#19
Wirko
The x86 SSE and AVX units/instruction sets actually can't do much arithmetics besides addition and multiplication (and FMA of course). SSE and SSE2 also have division, square root and reciprocals on 128-bit vectors, and that's all. So, for everything else, there must be a "transcendental math unit", however slow and unoptimised it may be. Intel uses the same term.
AnotherReaderApproximations are good enough for many purposes
Approximations must be good enough for nearly all purposes. (Let's forget for a moment that we share this world with accountants.)
Posted on Reply
#20
Vya Domus
WirkoSo, for everything else, there must be a "transcendental math unit", however slow and unoptimised it may be. Intel uses the same term.
Those are really old x87 instructions, nothing modern is gonna generate code using that ISA, for all intents and purpose there is no transcendental math unit being used on a modern CPU.

It's slow because it's just doing a series expansion, a bunch of x - x^3/3! +x^5/5!... to calculate sin(x) for example, there is no special way to do that in hardware, it will just have to be a series of instructions to calculate that up to whatever term. You can try and use look up tables or use approximations that converge faster but it will still all come down to a bunch of adds and multiplies in series, it's unavoidable.
Posted on Reply
#21
Wirko
Vya DomusThose are really old x87 instructions, nothing modern is gonna generate code using that ISA, for all intents and purpose there is no transcendental math unit being used on a modern CPU.

It's slow because it's just doing a series expansion, a bunch of x - x^3/3! +x^5/5!... to calculate sin(x) for example, there is no special way to do that in hardware, it will just have to be a series of instructions to calculate that up to whatever term. You can try and use look up tables or use approximations that converge faster but it will still all come down to a bunch of adds and multiplies in series, it's unavoidable.
Yes. x87 transcendental functions are slow and accurate, and there are no "fast and dirty" variants with ~32-bit accuracy for example, which would often suffice.

However, at least the scalar integer and FP multiplications have been optimised to no end - and they execute in *one clock cycle* on recent CPUs at least since Skylake and Zen 1. Vector integer and double-precision FP multiplications seem to be equally fast, but I don't understand enough about data formats there. Agner Fog collected all the data you can think of.
Posted on Reply
#22
Minus Infinity
Vya DomusYou don't understand what I am saying, this is similar to integer multiplication, there is no integer multiplication in hardware, it's just a subroutine that adds integers repeatedly, most early processors didn't have a mul instruction because it wasn't seen as necessary only later did most processors start to have mul instructions but they're still doing the same thing, just a bunch of micro ops that add integers repeatedly in a loop, it's said that the operation is implemented in hardware but it's just running other micro ops under the hood. Same thing applies for matrix or tensor ops.
cpu's have a binary multiplier that does not merely add integers repeatedly which would be grossly inefficient. Most techniques involve computing the set of partial products, which are then summed together using binary adders. cpu mulitpliers only take ~ 3 cycles vs 1 cycle for addition IIRC.
Posted on Reply
#23
ncrs
Vya DomusCan you guess why ? Could it be that it's actually all just implemented generically and that's why it works for multiple data types ? Just a thought. I bet the reason the number of SFUs is so much lower than the amount of CUDA cores is because they block a lot ports in the SM when they run.
I don't need to guess why since it's spelled out in the paper which you refused to even look at, again. And no, it's not what you think it is.
Vya Domusx86 is definitely portable, those are SIMD extensions, you can take something written in the 90s and it should technically run just fine on a modern system, the same cannot be said for pretty much everything GPU related, that's why most (all, really) software compiles shaders/kernels at runtime.
No, it wouldn't just run since recently a few obscure x86 instructions were re-purposed for mitigation use like VERW. Most would run, but not all.
Vya DomusYou are stuck in Nvidia jargon that doesn't mean much, use your common sense, if for instance a processor has a left logical shift instruction do you think it's worth putting that into it's own little square on a diagram ? AMD has instructions for transcendental functions as well in their GPUs, they never put that into architectural diagrams because it doesn't really matter, no other GPU makers that I know of does actually.
You're just making stuff up as you go? From the AMD RDNA whitepaper:
The dual compute unit includes new transcendental execution units to accelerate more
complicated math operations that are used in both graphics and general computing. Each
SIMD contains an 8-lane transcendental unit that can overlap execution with the main vector
ALUs and will complete a wavefront in four clock cycles.
Here's a diagram of Intel Xe execution units containing units for "extended math (EM) instructions, such as exp, log, and rcp":

I guess we have an industry-wide conspiracy for introducing emulated virtual hardware units to handle transcendentals. All 3 PC vendors agreed to lie at the same time in the same way I guess :)
Vya DomusSFUs exist in the same sense CUDA cores exist, Nvidia has no concrete definition for what a "CUDA core" is. As far as I can tell total CUDA cores = number of FP32 FMAD units, it's not a core by any definition, it doesn't fetch and decode instructions by itself, yet they insist to call it a "core". Same with these SFUs, they do not fetch and decode instructions either, the SM does since that's what's actually generating the wavefronts, so what does that unit even do ? Anyway this has been going on for too long, if you think Nvidia has some special hardware block to compute sin and cos and whatnot, that's fine, whatever.
Yes, NVIDIA is actually lying in their architectural whitepapers, in every single one. They publish scientific papers on the implementation of "sin and cos and whatnot", get it accepted to IEEE journal just for fun.
Are you a troll? You provided no sources for your claims, just how you think things work.

Edit: Added Intel.
Posted on Reply
#24
GuiltySpark
Vya DomusThat's literally how it's done, you cannot simply multiply two integers in the same way you can add them in binary, multiply operations take several cycles on all processors out there for this very reason. Just look it up, there is nothing crazy about it, same applies to division.
The fact it is taking more than one cycle it is NOT because it is done that way. It is because of the internal "pipelining" of a single architecture to achieve higher frequency. Moreover if you want you can perform all those additions in HW in a single cycle, that's how is done in all multipliers architecture being array-like or based on booth encoding or whatever other HW structure. No one nowadays makes multiplications by recursively adding a single result in SW-like routines or microcode or whatever you are proposing, it would be too time consuming!
Posted on Reply
#25
_Flare
Such a claim is disrespecting everyone who ever was or is in that business, as if those where plain stupid.
Investor scam warning!
Build it, show it running a 3D-game with at least 60 FPS, then trigger a press release again.
Posted on Reply
Add your own comment
Jan 17th, 2025 22:48 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts