Thing is if Nvidia, Intel, or AMD could focus and put all of the horsepower of their chips into using basically one OS with a single API and what not, or not worry about instruction sets for certain tech etc they could or probably would be significantly more powerful like Apple even using X86. They develop their own things be it Freesync or Gsync. TressFx, their versions of HRD. Ray Tracing, DLSS etc. Plus support on GPU side for Directx 10,11,12 and Vulkan etc.
This is rather inaccurate. While it's absolutely true that Apple has an inherent advantage through the API designer and hardware maker being the same company, ultimately Nvidia and AMD don't have meaningfully more APIs to deal with or a much higher degree of complexity - they just have a much less direct and likely more rigid structure through which to affect the design of APIs to fit their hardware. But it's not like designing DX12U GPUs has been a challenge for either company. And backwards compatibility is the exact same too - if Apple iterates Metal with new features that aren't supported by older hardware, developers will then need to account for that, just like they need to account for differences between DX10, 11 and 12 hardware.
Also, if MacOS was based around "a single API" as you say ... that wouldn't help much. The API would still need the same breadth of features and functionality of whatever combination of APIs competitors deal with. Integrating those features into a single API doesn't necessarily simplify things all that much. What matters is how well things are designed to interface with each other and how easy they are to learn and to program for. There is also nothing stopping AMD designing a Metal implementation of TressFX, for example - APIs stack on top of each other.
Then for some reason on the SSD storage for new MAC's then don't seem to have controllers or anything on them. Seems like Apple just is shortcutting things on the storage side of things, which seems kind of important if you care for your data.
Yeah, I'm well aware of this. If you've paid attention, you'd also notice that I have at no point extolled Apple's storage as anything special - in part because of this. Though saying "their SSDs don't seem to have controllers on them" is a misunderstanding - Apple has a very different storage hardware architecture than PCs, with the SSD controller integrated into the T2 system controller (which also handles security, encryption, etc.). In other words, the SSD controller doesn't live on the SSD. This is a
really weird choice IMO, and while I see its benefits (lower cost, less duplicate hardware in the system), I think it overall harms the system's performance.
API's and different things matter. It is like when Doom came out and supported Vulkan. The FPS jumped up drastically especially for AMD. That is why context matters. Hardware is made to run certain API's etc and Apple shows their hardware in the best light. If you are going to do that you have to take synthetic benchmarks out and then run the hardware you are testing them against in their best case scenarios.
Like here for instance is an Nvidia graphic:
View attachment 240509
Yes, it's obvious that APIs matter a lot, though your representation of this is a bit weird. It's not like AMD's hardware is inherently designed to be faster in Vulkan than DX12, it's just a quirk of their architecture compared to Nvidia. It wouldn't make sense for them to make such a design consciously, as Vulkan is
far rarer than DX11/12, and isn't used at all in consoles, which is a massive focus for AMD GPU architectures. Still, it is indeed important to keep in mind that different architectures interface differently with different APIs and thus perform differently - and that any given piece of software, especially across platforms, can use many different APIs to do "the same" work. That Nvidia graph is indeed a good example, as part of the background behind that is that most of those apps are not Metal native - Blender Cycles, for example,
got its Metal render backend 11 days ago, while it has had support for highly accelerated workloads on Nvidia with first CUDA, then the Optix renderer, for several years.
This just illustrates the difficulty of doing reasonable cross-platfrom benchmarks, and the very real discussion of whether one should focus on real-world applications with all their specific quirks (the Nvidia graph above was true at the time, after all, with those applications), or if the goal is to demonstrate the actual capabilities of the hardware on as level a playing field as possible. Platform-exclusive software makes this even more complicated, obviously.
Well that is what R&D is going to have to work on. ARM was made to be low power. Originally like under 5 watts. So Apple is pushing the bounds of ARM by going higher than ARM was originally intended. ARM was about low power and low heat. Now they are creating chips that are pushing both of those well past what ARM was created to do. I'm sure as Apple adds instruction sets and what not we won't really be able to say they are even using ARM based chips as they morph into something else, but still Apple is taking ARM into a world it hasn't been used before which means it is always going to be taking the first step into the next push of what ARM is doing or can do. Which means most of the R&D will be on their engineers and 100 percent of the cost of it on them.
This is rather inaccurate. While early ARM designs were indeed focused only on low power, there has been a concerted effort for high-performance ARM chips for more than half a decade, including server and datacenter chips from a variety of vendors. Apple is by no means alone in this - though their R&D resources are clearly second to none. Also, AFAIK, Apple can't add instruction sets -
their chips are based on the ARMv8.4-A instruction set. And while I'm reasonably sure there are architectural reasons for the M1 not scaling past ~3.2 GHz, it's impossible from an end user point of view to separate those from other hardware properties: the specific production node (Apple uses low-power, high density mobile-oriented nodes for all chips, with competitors like AMD and Intel using high-clocking nodes instead); their chips are designed with extremely wide execution pipelines,
far beyond any other ARM (or x86) design, which makes clocking them high potentially problematic. You can't unilaterally peg the performance ceiling of the M1 family on it being ARM-based.
As for the R&D and attributed costs:
Apple has been developing their own core and interconnect designs since the A6 in 2012. Their ARM licence is for fully custom designed cores, meaning they do not base their designs on ARM designs at all, and have not done so for a decade now.
The size of the chip is not what people think it is. For example one stick of 16GB DDR5 has more transistors than the entire Apple M1 Ultra chip. You can't compare an SoC with cache and other things to a GPU or CPU transistor count wise. We don't have accurate transistor counts, GPU versus GPU in the SoC.
This is problematic on many, many levels. First off: All modern CPUs have cache. GPUs also have cache, though in very varying amounts. Bringing that up as a difference between an SoC and a CPU is ... meaningless. (As is the distinction between SoC and CPU today - all modern CPUs from major vendors are SoCs, but with differing featuresets.) As for one stick of DDR5 having more transistors, that may be true, but that stick also has 8 or more dice on board, and those cosist nearly exclusively of a single type of high density transistor made on a bespoke, specific-purpose node. Logic transistors are far more complex and take much more space than memory transistors, and logic nodes are inherently less dense than memory and cache nodes (illustrated by how the cache die on the Ryzen 7 5800X3D fits 64MB of cache into the same area as the 32MB of cache on the CCD). If comparing transistor counts, the relevant comparisons must be of reasonably similar silicon, i.e. complex and advanced logic chips.
And no, we don't have feature-by-feature transistor counts. But that doesn't ultimately matter, as we can still combine the transistor counts of competing solutions for a reasonable approximate comparison - and Apple is
way above anything else even remotely comparable. This illustrates that they are achieving the combination of performance and efficiency they are by going all-in on a "wide-and-slow" design at every level (as Anandtech notes in their M1 coverage, the execution pipeline of the M1 architecture is also unprecedentedly wide for a consumer chip). This is clearly a conscious choice on their part, and one that can likely in part be attributed to their vertical integration - they don't need to consider per-chip pricing or profit margins (which for Intel and AMD tend to be ~40%), but can have a more holistic view, integrating R&D costs and production costs into the costs of the final hardware. It is far less feasible for a competitor to produce anything comparable just because of the sheer impossibility of getting any OEM to buy the chip to make a product with it.
The continued intellectual dishonesty on this threat is amazing. Any comparison to the M1 Ultra requires TWO, not one PC chips from INTEL, AMD, and/or NVIDA, not to mention the missing RAM chips.
I mean, I mostly agree with you, but ... no, you don't need that if you're talking about performance or efficiency. There's no logical or technical requirement that because the M1U has two (joined) chips, any valid comparison must also be two chips. What matters is how it performs in the real world. Heck, if that's the case, you'd need four M1 Ultras to compare to an EPYC 3rd gen, as those have
eight CCDs onboard. What matters is comparing the relative performance and combined featuresets of the systems in question - i.e. their capabilities. How their makers arrived at those capabilities is an interesting part of the background for such a discussion (and a separate, interesting discussion can be had as to the reasoning and circumstances behind those choices and their pros and cons), but you can't mandate that any comparison
must match some arbitrary design feature of any one system (as the 8-CCD EPYC example illustrates).
To compare the PERFORMANCE, SIZE and COST of the M1 Ultra you need:
1) A CPU from INTEL or AMD
2) A GPU form AMD or NIVIDA
3) 128GB of RAM
So now go back to that photo and add in all of the missing components, a GPU and RAM. Love to see the total COST when that's done. Throw in a motherboard to socket the RAM too.
You're not wrong, but I think you're taking this argument in the wrong direction. First off, this has been acknowledged and discussed earlier in this thread, in detail. Secondly, it's valid to discuss the tradeoffs and choices made by different firms designing different chips - even if I think the "and [they're] still ending up with way bigger chips too" angle from the post you quoted is a bad take in desperate need of perspective and nuance.