Monday, October 18th 2021

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

Apple today announced M1 Pro and M1 Max, the next breakthrough chips for the Mac. Scaling up M1's transformational architecture, M1 Pro offers amazing performance with industry-leading power efficiency, while M1 Max takes these capabilities to new heights. The CPU in M1 Pro and M1 Max delivers up to 70 percent faster CPU performance than M1, so tasks like compiling projects in Xcode are faster than ever. The GPU in M1 Pro is up to 2x faster than M1, while M1 Max is up to an astonishing 4x faster than M1, allowing pro users to fly through the most demanding graphics workflows.

M1 Pro and M1 Max introduce a system-on-a-chip (SoC) architecture to pro systems for the first time. The chips feature fast unified memory, industry-leading performance per watt, and incredible power efficiency, along with increased memory bandwidth and capacity. M1 Pro offers up to 200 GB/s of memory bandwidth with support for up to 32 GB of unified memory. M1 Max delivers up to 400 GB/s of memory bandwidth—2x that of M1 Pro and nearly 6x that of M1—and support for up to 64 GB of unified memory. And while the latest PC laptops top out at 16 GB of graphics memory, having this huge amount of memory enables graphics-intensive workflows previously unimaginable on a notebook. The efficient architecture of M1 Pro and M1 Max means they deliver the same level of performance whether MacBook Pro is plugged in or using the battery. M1 Pro and M1 Max also feature enhanced media engines with dedicated ProRes accelerators specifically for pro video processing. M1 Pro and M1 Max are by far the most powerful chips Apple has ever built.
"M1 has transformed our most popular systems with incredible performance, custom technologies, and industry-leading power efficiency. No one has ever applied a system-on-a-chip design to a pro system until today with M1 Pro and M1 Max," said Johny Srouji, Apple's senior vice president of Hardware Technologies. "With massive gains in CPU and GPU performance, up to six times the memory bandwidth, a new media engine with ProRes accelerators, and other advanced technologies, M1 Pro and M1 Max take Apple silicon even further, and are unlike anything else in a pro notebook."

M1 Pro: A Whole New Level of Performance and Capability
Utilizing the industry-leading 5-nanometer process technology, M1 Pro packs in 33.7 billion transistors, more than 2x the amount in M1. A new 10-core CPU, including eight high-performance cores and two high-efficiency cores, is up to 70 percent faster than M1, resulting in unbelievable pro CPU performance. Compared with the latest 8-core PC laptop chip, M1 Pro delivers up to 1.7x more CPU performance at the same power level and achieves the PC chip's peak performance using up to 70 percent less power. Even the most demanding tasks, like high-resolution photo editing, are handled with ease by M1 Pro.
M1 Pro has an up-to-16-core GPU that is up to 2x faster than M1 and up to 7x faster than the integrated graphics on the latest 8-core PC laptop chip. Compared to a powerful discrete GPU for PC notebooks, M1 Pro delivers more performance while using up to 70 percent less power. And M1 Pro can be configured with up to 32 GB of fast unified memory, with up to 200 GB/s of memory bandwidth, enabling creatives like 3D artists and game developers to do more on the go than ever before.
M1 Max: The World's Most Powerful Chip for a Pro Notebook
M1 Max features the same powerful 10-core CPU as M1 Pro and adds a massive 32-core GPU for up to 4x faster graphics performance than M1. With 57 billion transistors—70 percent more than M1 Pro and 3.5x more than M1—M1 Max is the largest chip Apple has ever built. In addition, the GPU delivers performance comparable to a high-end GPU in a compact pro PC laptop while consuming up to 40 percent less power, and performance similar to that of the highest-end GPU in the largest PC laptops while using up to 100 watts less power. This means less heat is generated, fans run quietly and less often, and battery life is amazing in the new MacBook Pro. M1 Max transforms graphics-intensive workflows, including up to 13x faster complex timeline rendering in Final Cut Pro compared to the previous-generation 13-inch MacBook Pro.
M1 Max also offers a higher-bandwidth on-chip fabric, and doubles the memory interface compared with M1 Pro for up to 400 GB/s, or nearly 6x the memory bandwidth of M1. This allows M1 Max to be configured with up to 64 GB of fast unified memory. With its unparalleled performance, M1 Max is the most powerful chip ever built for a pro notebook.

Fast, Efficient Media Engine, Now with ProRes
M1 Pro and M1 Max include an Apple-designed media engine that accelerates video processing while maximizing battery life. M1 Pro also includes dedicated acceleration for the ProRes professional video codec, allowing playback of multiple streams of high-quality 4K and 8K ProRes video while using very little power. M1 Max goes even further, delivering up to 2x faster video encoding than M1 Pro, and features two ProRes accelerators. With M1 Max, the new MacBook Pro can transcode ProRes video in Compressor up to a remarkable 10x faster compared with the previous-generation 16-inch MacBook Pro.
Advanced Technologies for a Complete Pro System
Both M1 Pro and M1 Max are loaded with advanced custom technologies that help push pro workflows to the next level:
  • A 16-core Neural Engine for on-device machine learning acceleration and improved camera performance.
  • A new display engine drives multiple external displays.
  • Additional integrated Thunderbolt 4 controllers provide even more I/O bandwidth.
  • Apple's custom image signal processor, along with the Neural Engine, uses computational video to enhance image quality for sharper video and more natural-looking skin tones on the built-in camera.
  • Best-in-class security, including Apple's latest Secure Enclave, hardware-verified secure boot, and runtime anti-exploitation technologies.A Huge Step in the Transition to Apple Silicon
  • The Mac is now one year into its two-year transition to Apple silicon, and M1 Pro and M1 Max represent another huge step forward. These are the most powerful and capable chips Apple has ever created, and together with M1, they form a family of chips that lead the industry in performance, custom technologies, and power efficiency.
macOS and Apps Unleash the Capabilities of M1 Pro and M1 Max
macOS Monterey is engineered to unleash the power of M1 Pro and M1 Max, delivering breakthrough performance, phenomenal pro capabilities, and incredible battery life. By designing Monterey for Apple silicon, the Mac wakes instantly from sleep, and the entire system is fast and incredibly responsive. Developer technologies like Metal let apps take full advantage of the new chips, and optimizations in Core ML utilize the powerful Neural Engine so machine learning models can run even faster. Pro app workload data is used to help optimize how macOS assigns multi-threaded tasks to the CPU cores for maximum performance, and advanced power management features intelligently allocate tasks between the performance and efficiency cores for both incredible speed and battery life.

The combination of macOS with M1, M1 Pro, or M1 Max also delivers industry-leading security protections, including hardware-verified secure boot, runtime anti-exploitation technologies, and fast, in-line encryption for files. All of Apple's Mac apps are optimized for—and run natively on—Apple silicon, and there are over 10,000 Universal apps and plug-ins available. Existing Mac apps that have not yet been updated to Universal will run seamlessly with Apple's Rosetta 2 technology, and users can also run iPhone and iPad apps directly on the Mac, opening a huge new universe of possibilities.
Apple's Commitment to the Environment
Today, Apple is carbon neutral for global corporate operations, and by 2030, plans to have net-zero climate impact across the entire business, which includes manufacturing supply chains and all product life cycles. This also means that every chip Apple creates, from design to manufacturing, will be 100 percent carbon neutral.
Add your own comment

156 Comments on Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

#151
Aquinus
Resident Wat-man
TheoneandonlyMrKI agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.
Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.
Posted on Reply
#152
Valantar
AquinusConsider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.
That's kind of how I see the term as well. Sure, many people use it as a generic term for "performance per clock", but even then that's reasonably accurate. (Please don't get me started on the people who think "IPC" means "performance".) Theoretical maximum IPC isn't relevant in real-world use cases, as processors, systems and applications are far too complex for this to be a 1:1 relation, so the point of calculating "IPC" for comparison is to see how well the system as a whole is able to run a specific application/collection of applications while accounting for clock speed. Is it strictly a measure of instructions? Obviously not - CPU-level instructions aren't generally visible to users, after all, and if you're using real-world applications for testing then you can't really know that unless you have access to the source code either. So it's all speculative to some degree. Which is perfectly fine.

This means that the only practically applicable and useful way of defining IPC in real-world use is clock-normalized performance in known application tests. These must of course be reasonably well written, and should ideally be representative of overall usage of the system. That last point is where it gets iffy, as this is extremely variable, and why for example SPEC is a poor representation of gaming workloads - the tests are just too different. Does that make SPEC testing any less useful? Not whatsoever. It just means you need a modicum of knowledge of how to interpret the results. Which is always the case anyhow.

This also obviously means there's an argument for there being more (collections of) test(s), as no benchmark collection will ever be wholly representative, and any single score/average/geomean/whatever calculated from a collection of tests will never be so either. But again, this is obvious, and is not a problem. Measured IPC should always be understood as having "averaged across a selection of tested applications" tacked on at the end. And that's perfectly fine. This is neither hand-wavy, vague, or problematic, but a necessary consequence of PCs and their uses being complex.
Posted on Reply
#153
ValenOne
ValantarReBAR doesn't have anything to do with this - it allows the CPU to write to the entire VRAM rather than smaller chunks, but the CPU still can't work off of VRAM - it needs copying to system RAM for the CPU to work on it. You're right that shared memory has its downsides, but with many times the bandwidth of any x86 CPU (and equal to many dGPUs) I doubt that will be a problem, especially considering Apple's penchant for massive caches.
1. It comes down to hardware configuration optimal for the use case.

PC's Direct12 has the option to minimize copy with the game's resource.

www.slideshare.net/DICEStudio/framegraph-extensible-rendering-architecture-in-frostbite



2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.
Posted on Reply
#154
Valantar
rvalencia1. It comes down to hardware configuration optimal for the use case.
Well, yes. That is kind of obvious, no? "The hardware best suited to the task is best suited to the task" is hardly a revelation.
rvalenciaPC's Direct12 has the option to minimize copy with the game's resource.

www.slideshare.net/DICEStudio/framegraph-extensible-rendering-architecture-in-frostbite

That presentation has nothing to do with unified memory architectures, so I don't see why you bring it up. All it does is present the advantages in Frostbite of an aliasing memory layout, reducing the memory footprint of transient data. While this no doubt reduces copying, it has no bearing on whether or not these are unified memory layouts.
rvalencia2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.
And? Are you saying it's sufficiently inferior to make up for a 3-6x cache size disadvantage? 'Cause Anandtech's benchmarks show otherwise.
Posted on Reply
Add your own comment
Jun 10th, 2024 15:50 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts