PSSR, as everything Sony, is fully proprietary, poorly documented to the public and apparently has been relatively poorly received so far. I don't believe it has any particular need for ML hardware since the PS5 Pro's graphics are still based on RDNA 2, which does not have this capability. Unless there is a semicustom solution, but I don't believe this to be the case.
It's a ML-based upscaling as well, but it doesn't make use of any extra specific hardware. RDNA3.5 (which the PS5 Pro kinda uses) has some extra instructions meant to process some stuff relevant for matmul in lower precision, you can read more about it here:
Integrated graphics have been a key part of AMD’s strategy ever since they bought ATI.
chipsandcheese.com
With the extra hardware bump, it should be able to run an upscaling CNN without much issues and no need for extra hardware (apart from what's in the GPU itself).
Got you... But what I'm wondering is how can AMD/Sony (allegedly) do this in FSR4 without some "supercomputer" doing the work for them to upscale the image with minimal artifacts?
Sony's PSSR did do something similar to what Nvidia has done with DLSS, by training a model with tons of compute during a long period of time, which can then be used by the actual consoles, they just did not announce it like Nvidia did now. And if Nvidia never gave up this detail away, you wouldn't be making this complaint.
FSR4, if truly based on ML, will also require lots of compute time beforehand in order to create a model that can perform this task in your local GPU.
Let me try to give you a better example: do you know that feature in phone's gallery that are able to recognize people or places?
That's a machine learning model that's running in your phone and tagging those images behind the scenes.
That "model" (think of a "model" as a binary or dll that contains the "runtime" of the AI stuff) has been trained by google/samsung/apple in their servers for long hours with tons of examples saying "this picture is a dog", "this is a car", "this is a beach", "this person X is different from person Y", etc etc. This part is the "training" part, which is really compute intensive and takes really long time. As an example, the GPT model behind ChatGPT took around 5~6 months to train.
The outcome of this model is then shipped into your phone, where it's able to use what it has learnt and apply it to your cat pictures, and say that it is a cat. This part is called the "inference" part, and is often really fast. Think how DLSS, even in its first version, was able to upscale a frame from a smaller res into a higher one with really fast FPS (so for each frame, it upscaled it in less than 10ms!). In a similar manner, think how your phone is able to tag a pic as a "dog" really quick, or how ChatGPT is able to give you answers reasonably fast, even though the training part for all of those tasks took weeks, months, or even years.