Friday, June 1st 2018
An ARM to Rule Them All: ARM 76 To Challenge x86 Chips in the Laptop Space?
ARM has announced their next, high-performance computing solution with their A76 design, which brings another large performance increase to the fledgling architecture. having been touted for some time as a true contender to the aging x86 architecture, ARM has had a way of extracting impressive performance increases with each iteration of its computing designs, in the order of 20% do 40% performance increases in an almost annual basis. Compare that to the poster-child of x86 computing, Intel, and its passivity-fueled 5 to 10% yearly performance increases, and the projections aren't that hard to grasp: at some point in time, ARM cores will surpass x86 in performance - at least on the mobility space.
The new ARM A76 design, to be manufactured on the 7 nm process, brings about a 35% increase in performance compared to last years' A75. This comes with an added 40% power efficiency (partly from the 10 nm to 7 nm transition, the rest from architecture efficiency and performance improvements), despite the increase to maximum 3.0 GHz clocks. With the added performance, ARM is saying the new A76 will deliver 4x the Machine Learning performance of its previous A75 design.Adding to those CPU performance improvements, is ARM's Mali-G76 GPU solution, which also packs some 30% increases in performance density (meaning, for the same silicon footprint, added 30% performance), accompanied by 30% better energy efficiency and 2.7x increased Machine Learning performance for GPU-accelerated workloads. The new GPU architecture supports up to three execution engines per shader core, features a dual texture mapper, presents configurable 2-4 slices of L2 cache, and supports up to 20 "cores" in devices for process and workload distribution.This combination of CPU (with the ARM A76) and GPU (with the Mali-G76) performance improvements mean that ARM is now within spitting distance of x86 solutions in the mobile space; this, and the future performance projections should ARM be able to keep its development and performance improvement pace, may be one of the reasons why Microsoft invested the way it did in adding ARM support for its Windows operating system in recent times. ARM solutions that employ Microsoft's OS do provide better battery life than their x86 counterparts, and with the latest ARM 76 improvements, which are seemingly more significant than any x86 performance and efficiency increases in recent times, may well mean a push for x86 towards higher levels of required performance, leaving the entry productivity and content consumption scenarios for ARM-powered devices and architectures.
Sources:
AnandTech, Tom's Hardware
The new ARM A76 design, to be manufactured on the 7 nm process, brings about a 35% increase in performance compared to last years' A75. This comes with an added 40% power efficiency (partly from the 10 nm to 7 nm transition, the rest from architecture efficiency and performance improvements), despite the increase to maximum 3.0 GHz clocks. With the added performance, ARM is saying the new A76 will deliver 4x the Machine Learning performance of its previous A75 design.Adding to those CPU performance improvements, is ARM's Mali-G76 GPU solution, which also packs some 30% increases in performance density (meaning, for the same silicon footprint, added 30% performance), accompanied by 30% better energy efficiency and 2.7x increased Machine Learning performance for GPU-accelerated workloads. The new GPU architecture supports up to three execution engines per shader core, features a dual texture mapper, presents configurable 2-4 slices of L2 cache, and supports up to 20 "cores" in devices for process and workload distribution.This combination of CPU (with the ARM A76) and GPU (with the Mali-G76) performance improvements mean that ARM is now within spitting distance of x86 solutions in the mobile space; this, and the future performance projections should ARM be able to keep its development and performance improvement pace, may be one of the reasons why Microsoft invested the way it did in adding ARM support for its Windows operating system in recent times. ARM solutions that employ Microsoft's OS do provide better battery life than their x86 counterparts, and with the latest ARM 76 improvements, which are seemingly more significant than any x86 performance and efficiency increases in recent times, may well mean a push for x86 towards higher levels of required performance, leaving the entry productivity and content consumption scenarios for ARM-powered devices and architectures.
74 Comments on An ARM to Rule Them All: ARM 76 To Challenge x86 Chips in the Laptop Space?
I hate to break it to all you guys, but just because Windows is closed source doesn't mean it can't be rebuilt pretty easily for other machines. It can be. You just need... the source code! Which presumably, Microsoft has.
This isn't some *nix* magic marvel mirracle. It's called recompiling. Linux solved the flexibility equation by making their box open. Microsoft's is opaque and needs the men who have special priviliges to recompile the coveted MS source code do the work. That's all.
Firefox, for example, was compiled for 64 bit Linux as early as 2011. The first 64 bit build for Windows was in 2015. And that was just a user-space application, not an entire OS ;) Not really. I was thinking more like Ubuntu (or something) machines. Set one up for your parents, put the Firefox icon on the desktop and you're covered a lot of their use cases ;)
32-bit -> 64-bit is really a different animal though, because the memory model and all structures related to it change. It is kind of tough to jump that way.
But you know what? Wild thought: let's wait and see the chips first. Because if ARM can't keep the perf/W advantage, we're beating around a dead horse.
There are simple reasons for this;
1. What happens when there is no network to connect to? No network and you effectively have a brick that looks pretty.
2. What happens when an update to a cloud app fails in some way? You can't roll it back to a working version because it's in the cloud and not installed local to the device.
3. Not everyone trusts the cloud for many and various reasons.
There more reasons why computing in the cloud is only useful to a select group of users, but I digress.
Commenting on an idea hinted at earlier, I think it would be interesting to see Windows 10 with the x86 runtimes performance numbers. And various distro's of Linux/BSD running on this new SOC? There are some devs that are going to have fun with this. And has been. Back in the day, Windows NT was compiled to run on MIPS and ARM, with good(not great) support for legacy software. WinNT ran smooth as silk on SGI machines of the day. At the time it was mind-blowing to see Windows running on non-PC hardware.
ARM is evolving fast and heavy. What I would like to see though is what that evolution looks like in terms of transistors or die size.
By now, ARM and the whole mobile space is driven to a smaller and smaller process, TSMC 10nm SOCs are quite widespread and 7nm is coming soon-ish. x86 and desktop are actually a bit behind.
Apple's A11 - 6-core, should be one of the (if not the) quickest mobile SOC at the moment - has about twice the transistors of a Skylake i7. Thanks to 10nm process, it does have around 25% smaller die.
Kirin 970, Exynos 9810, Snapdragon 845 have at least the performance of an upper-class desktop/laptop CPU's from 10 years ago, such as C2D E7400 / T9600 / X2 5800+. I know that x86 and ARM architectures are two different things, but the smartphone chips became very powerful lately. Their GPU's have reached levels of mid-range AMD's and nVidia's GPU's from 7 years ago, such as GTX 545 / HD 6570.
Considering it's an SOC, it's rather amazing how such a small chip without big air fans or water cooling can reach such performance with only a fraction of power consumption.
- Kirin 970 is 8-core: 4*A73 @ 2.36GHz + 4*A53 @ 1.8GHz
- Exynos 9810 is 8-core: A55 - 4xMongoose3 @ 2.9GHz + 4xA55 @ 1.9GHz
- Snapdragon 845 is 8-core: 4*A75(custom) @ 2.8GHz + 4*A55(custom) @ 1.8GHz
As far as transistor count goes, these are all larger than current-generation Ryzen and probably also larger than 8-core Xeons. Very much up-to-date stuff.
- C2D e7400/T9600 are 2-core @ 2.8GHz at 45nm process from 2008 (10 years ago)
- X2 5800+ is 2-core @ 3.0GHz at 65nm process from 2006 (12 years ago)
Granted, both the old CPUs are slower than the 3 mobile SOCs but not as much as you would expect. Technology-wise, the gulf is huge. Over a decade of both process and architecture improvements.
Despite appearances, desktop has not been standing still and has been evolving (although more slowly than mobile).
C2D has 228 million transistors. A64 X2 has 221 million.
Kirin 970 has 5.5 billion. Exynos 9810 and Snapdragon 845 should not be that much smaller. At least in the range of A11 at 4.3 billion.
Things that mobile SOCs have that desktop chips do not is primarily the (proper) GPU and modem.
Desktop GPUs:
- HD6570 - 480:24:8 @ 650MHz doing 624 GFLOPS
- GT545 - 144:24:16 @ 870/1740Mhz doing 500 GFLOPS
Mobile GPUs:
- Kirin 970: Mali-G72 MP12 - 384:12:12 @ 746MHz doing 573 GFLOPs
- Exynos 9810: Mali-G72 MP18 - 576:18:18 @ 572MHz doing 659 GFLOPs
- Snapdragon 845: Adreno 630 - 256:24:16 @ 710MHz doing 727 GFLOPs
For comparison, lets add current Intel's main iGPU:
- HD/UHD 630 (GT2) - 192:16:8 @1200MHz doing 460 GFLOPs
Edit:
As far as transistor counts go:
HD6570 is Turks and 716 million transistors.
GT545 is cut down (25% of chip disabled) GF116 and 1170 million transistors. Yup. Intel and AMD architectures simply does not scale that low. Similarly, ARM or A11 does not scale much higher.
DSPs are large, modems should not be too large. There is quite a lot of extra stuff on the chip for sure. AI is marketing term, probably for GPU.
Looking at the specs above, all these GPUs (as well ass Apple's in A11) are roughly billion transistors if not less. DSP is probably in the same range. I would expect the CPU cores to be roughly billion transistors, maybe a bit more for 8-core SOCs.
Transistor counts are a tricky thing. Not that SOC manufacturers are eager to disclose details but A11 is 4.3 billion, Kirin 970 is reportedly 5.5 billion, Snapdragon 845 is larger than 835 (3 billion), Exynos 9810 is unknown. Transistors and die size are related but not directly or linearly. Google gave this image for relative die sizes of Snapdragon 845, A11, and Exynos 9810: (Kirin 970 is 96.72 mm²).
Edit2:
Apple had some images in their A11 presentation highlighting separate parts of the SOC.
Based on images from presentation as well as the die picture linked above (where not all parts of GPU and CPU are highlighted) and assuming transistor density is roughly equal in A11 SOC (which it likely is), this is what the transistor counts of areas should be:
- CPU cores ~1 billion
- GPU ~0.85 billion
- ISP ~0.75 billion
Look at Libreoffice. The defacto productivity suite on Linux. Only builds on x86 and x64, ARM builds require a shit ton of patches that are not part of the upstream source. Can't build on POWER properly without another shit ton of patches.
Or even better, look at the Chromium source code. Fully open, but only builds for x64. Again, ARM builds require patches not carried by upstream. And completely not buildable on any other architecture such as POWER or MIPs.
For bonus points, I'm not sure why you think LibreOffice depends on the closed version of Java. OpenJDK has pretty much the same features (and has been compiled for platforms I wouldn't think were capable of running it).
Anyhow, when Windows NT was being developed, RISC was viewed as the the new cool thing and x86 as "bloated" and "slow." As such, hedging their bets on both platforns, Microsoft built NT (upon which all modern Windows are based) with built with portability as a top priority. That legacy remains to this day.
But don't take my word for it. Google is your friend.
Sorry for the history lesson. But you are absolutely right about open vs closed source apps.
Yeah, good points on OO as well, have a thanks. I thought it depended on the closed source Java JIT compiler, but it seems I am wrong.