Monday, March 3rd 2025

NVIDIA GeForce RTX 50 Series Faces Compute Performance Issues Due to Dropped 32-bit Support
PassMark Software has identified the root cause behind unexpectedly low compute performance in NVIDIA's new GeForce RTX 5090, RTX 5080, and RTX 5070 Ti GPUs. The culprit: NVIDIA has silently discontinued support for 32-bit OpenCL and CUDA in its "Blackwell" architecture, causing compatibility issues with existing benchmarking tools and applications. The issue manifested when PassMark's DirectCompute benchmark returned the error code "CL_OUT_OF_RESOURCES (-5)" on RTX 5000 series cards. After investigation, developers confirmed that while the benchmark's primary application has been 64-bit for years, several compute sub-benchmarks still utilize 32-bit code that previously functioned correctly on RTX 4000 and earlier GPUs. This architectural change wasn't clearly documented by NVIDIA, whose developer website continues to display 32-bit code samples and documentation despite the removal of actual support.
The impact extends beyond benchmarking software. Applications built on legacy CUDA infrastructure, including technologies like PhysX, will experience significant performance degradation as computational tasks fall back to CPU processing rather than utilizing the GPU's parallel architecture. While this fallback mechanism allows older applications to run on the RTX 40 series and prior hardware, the RTX 5000 series handles these tasks exclusively through the CPU, resulting in substantially lower performance. PassMark is currently working to port the affected OpenCL code to 64-bit, allowing proper testing of the new GPUs' compute capabilities. However, they warn that many existing applications containing 32-bit OpenCL components may never function properly on RTX 5000 series cards without source code modifications. The benchmark developer also notes this change doesn't fully explain poor DirectX9 performance, suggesting additional architectural changes may affect legacy rendering pathways. PassMark updated its software today, but legacy benchmarks could still suffer. Below is an older benchmark run without the latest PassMark V11.1 build 1004 patches, showing just how much the newest generations suffers without a proper software support.
Sources:
PassMark on X, via Tom's Hardware
The impact extends beyond benchmarking software. Applications built on legacy CUDA infrastructure, including technologies like PhysX, will experience significant performance degradation as computational tasks fall back to CPU processing rather than utilizing the GPU's parallel architecture. While this fallback mechanism allows older applications to run on the RTX 40 series and prior hardware, the RTX 5000 series handles these tasks exclusively through the CPU, resulting in substantially lower performance. PassMark is currently working to port the affected OpenCL code to 64-bit, allowing proper testing of the new GPUs' compute capabilities. However, they warn that many existing applications containing 32-bit OpenCL components may never function properly on RTX 5000 series cards without source code modifications. The benchmark developer also notes this change doesn't fully explain poor DirectX9 performance, suggesting additional architectural changes may affect legacy rendering pathways. PassMark updated its software today, but legacy benchmarks could still suffer. Below is an older benchmark run without the latest PassMark V11.1 build 1004 patches, showing just how much the newest generations suffers without a proper software support.
74 Comments on NVIDIA GeForce RTX 50 Series Faces Compute Performance Issues Due to Dropped 32-bit Support
Until then, keep up the good fight.
It's also pretty much the only game in town on AMD, and they do have market share. So, not dead.
If you are asking me, I've toyed with it yes. Can't really discuss what I use it for (my job is quite sensitive these days).
Of course it could and that is why any processsing, benchmarking in that case, needs to be done after a set of verifications.
In OpenCL programming world the set of verifications need to be completed during initialization and This is how it looks like in my codes:
...
iOk = OclGetDeviceInfo( clDeviceId[ uiCurrentDevice ], CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof( CLuint ), ( CLvoid * )&ulPropValue, &uiRetValue );
OclPrintf2( OTU("\t\tCL_DEVICE_MAX_COMPUTE_UNITS : %12u\n"), ( CLuint )ulPropValue );
...
iOk = OclGetDeviceInfo( clDeviceId[ uiCurrentDevice ], CL_DEVICE_MAX_MEM_ALLOC_SIZE,
sizeof( CLulong ), ( CLvoid * )&ulPropValue, &uiRetValue );
OclPrintf2( OTU("\t\tCL_DEVICE_MAX_MEM_ALLOC_SIZE: %12.0f bytes\n"), ( CLfloat )ulPropValue );
...
iOk = OclGetDeviceInfo( clDeviceId[ uiCurrentDevice ], CL_DEVICE_GLOBAL_MEM_SIZE,
sizeof( CLulong ), ( CLvoid * )&ulPropValue, &uiRetValue );
OclPrintf2( OTU("\t\tCL_DEVICE_GLOBAL_MEM_SIZE : %12.0f bytes\n"), ( CLfloat )ulPropValue );
...
iOk = OclGetDeviceInfo( clDeviceId[ uiCurrentDevice ], CL_DEVICE_LOCAL_MEM_SIZE,
sizeof( CLulong ), ( CLvoid * )&ulPropValue, &uiRetValue );
OclPrintf2( OTU("\t\tCL_DEVICE_LOCAL_MEM_SIZE : %12.0f bytes\n"), ( CLfloat )ulPropValue );
...
In my OpenCL codes as soon as these steps completed some Run-Time values updated and Only After That processing continues.
It is very important to pay attention to all memory size related values because they are different for 32-bit and 64-bit OpenCL drivers for an OpenCL platform!
32-bit memory related values are usually lower than 64-bit values for the OpenCL platform.
As I've already mentioned the OpenCL device initialization is a Multi-Step process and memory is allocated after all these steps successfully completed:
...
iOk = OclGetPlatformIDs( 0, RTnull, &uiNumOfPlatforms );
if( iOk != CL_SUCCESS )
break;
if( uiNumOfPlatforms > _RTNUMBER_OF_PLATFORMS )
break;
iOk = OclGetPlatformIDs( uiNumOfPlatforms, &clPlatformId[0], RTnull );
if( iOk != CL_SUCCESS )
break;
iOk = OclGetPlatformInfo( clPlatformId[ iPlatformId ], CL_PLATFORM_NAME, 64, &g_szPlatformName[0], RTnull );
if( iOk != CL_SUCCESS )
break;
OclPrintf2( OTU("\tPlatform Name : %s\n"), &g_szPlatformName[0] );
iOk = OclGetDeviceIDs( clPlatformId[ iPlatformId ], iDeviceType, 1, &clDeviceId, RTnull );
if( iOk != CL_SUCCESS )
{
OclPrintf2( OTU("\tDevice of selected type is Not supported: %d\n"), iOk );
break;
}
iOk = OclGetDeviceInfo( clDeviceId, CL_DEVICE_NAME, 64, &g_szDeviceName[0], RTnull );
if( iOk != CL_SUCCESS )
break;
RTint n = 0;
while( g_szDeviceName[n] == ' ' )
n += 1;
OclPrintf2( OTU("\tDevice Name : %s\n"), &g_szDeviceName[n] );
clContext = OclCreateContext( RTnull, 1, &clDeviceId, RTnull, RTnull, &iOk );
if( iOk != CL_SUCCESS )
break;
if( clContext == RTnull )
break;
CLCommandQueueProperties clQueueProps = 0;
clQueueProps |= CL_QUEUE_PROFILING_ENABLE;
clCommandQueue = OclCreateCommandQueue( clContext, clDeviceId, clQueueProps, &iOk );
if( iOk != CL_SUCCESS )
break;
if( clCommandQueue == RTnull )
break;
clProgram = OclCreateProgramWithSource( clContext, 1, &szKernelFunction02I, RTnull, &iOk );
if( iOk != CL_SUCCESS )
break;
if( clProgram == RTnull )
break;
iOk = OclBuildProgram( clProgram, 1, &clDeviceId, RTnull, RTnull, RTnull );
if( iOk != CL_SUCCESS )
break;
clKernel = OclCreateKernel( clProgram, "KernelMemSetI", &iOk );
if( iOk != CL_SUCCESS )
break;
if( clKernel == RTnull )
break;
if( iDataSetSize == 0 )
break;
piDataSet1 = ( CLint * )CrtMalloc( iDataSetSize * sizeof( CLint ) );
if( piDataSet1 == RTnull )
break;
for( i = 0; i < iDataSetSize; i += 1 )
piDataSet1 = 0;
...
I remember that error CL_OUT_OF_RESOURCES ( -5 ) was always related to an attempt to allocate the device memory that exceeds numbers for CL_DEVICE_MAX_MEM_ALLOC_SIZE, or CL_DEVICE_GLOBAL_MEM_SIZE, or CL_DEVICE_LOCAL_MEM_SIZE params.
Not to mention people still play older games and use older software.
Really its Nvidia who thought this out poorly.
They cut it off based on hardware generation, and seemed to have not provided any warning it was going to happen. Difference is that Windows still has Wow64 system, Nvidia need to do a translation layer, which you did mention at end of your post.
Latest Nvidia certified driver
Latest Intel certified driver
What exactly DO you think people are developing with on AMD cards? Because if theres an alternative maybe you could actually teach me something (it may be that by virtue of me being in gentoo linux now, OSS mesa props it up, come to think of it).
But still, they recommend using hip instead of opencl for ROCm.
Tech. Just when you think you learned something cool, its obsolete lol.
End of the code block [/code]
Start of the code block is without /
example
I'm not sure if that opencl language is proper used with all those /break statements.
I hardly see any additional information in posting a huge wall of code without using the code blocks. Where is the additional information? Without explanation of what is being shown?
Finding maybe the user or ai generated cooments in such a text wall with code is not easy. Grasping why it is posted in the first place is not that easy for myself with decent C knowledge. I think nvidia is against it to broadly support CUDA. I remember some software which can do cuda. I think there are legal issues from NVIDIA. I think I read it here in past months.
Just as a starting point.
www.tomshardware.com/pc-components/gpus/amd-asks-developer-to-take-down-open-source-zluda-dev-vows-to-rebuild-his-project
When someone knows the details better, please fill in teh details.
www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers
Note: I could just read a few minutes ago more comments on my android tablet. Here they are on the well deserved ignore list. Just don't. Stick to the topic please
--
You may buy a nvidia graphic card and than you may expect very old software to be supported? You can not be serious! Closed - source - binary - windows - blob
>>...I'm not sure if that opencl language is proper used with all those /break statements....
Do Not go personal and do Not teach experienced Software Engineers how to implement some functionality.