# AMD "Zen 2" IPC 29 Percent Higher than "Zen"



## btarunr (Nov 12, 2018)

AMD reportedly put out its IPC (instructions per clock) performance guidance for its upcoming "Zen 2" micro-architecture in a version of its Next Horizon investor meeting, and the numbers are staggering. The next-generation CPU architecture provides a massive 29 percent IPC uplift over the original "Zen" architecture. While not developed for the enterprise segment, the stopgap "Zen+" architecture brought about 3-5 percent IPC uplifts over "Zen" on the backs of faster on-die caches and improved Precision Boost algorithms. "Zen 2" is being developed for the 7 nm silicon fabrication process, and on the "Rome" MCM, is part of the 8-core chiplets that aren't subdivided into CCX (8 cores per CCX). 

According to Expreview, AMD conducted DKERN + RSA test for integer and floating point units, to arrive at a performance index of 4.53, compared to 3.5 of first-generation Zen, which is a 29.4 percent IPC uplift (loosely interchangeable with single-core performance). "Zen 2" goes a step beyond "Zen+," with its designers turning their attention to critical components that contribute significantly toward IPC - the core's front-end, and the number-crunching machinery, FPU. The front-end of "Zen" and "Zen+" cores are believed to be refinements of previous-generation architectures such as "Excavator." Zen 2 gets a brand-new front-end that's better optimized to distribute and collect workloads between the various on-die components of the core. The number-crunching machinery gets bolstered by 256-bit FPUs, and generally wider execution pipelines and windows. These come together yielding the IPC uplift. "Zen 2" will get its first commercial outing with AMD's 2nd generation EPYC "Rome" 64-core enterprise processors.



 



*Update Nov 14*: AMD has issued the following statement regarding these claims.


> As we demonstrated at our Next Horizon event last week, our next-generation AMD EPYC server processor based on the new 'Zen 2' core delivers significant performance improvements as a result of both architectural advances and 7nm process technology. Some news media interpreted a 'Zen 2' comment in the press release footnotes to be a specific IPC uplift claim. The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications. We will provide additional details on 'Zen 2' IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.



*View at TechPowerUp Main Site*


----------



## Prima.Vera (Nov 12, 2018)

Bulldozer, Excavator, ... no thank you. No more hyping until the community benches are out.


----------



## londiste (Nov 12, 2018)

Well, that definitely is not IPC, at least not until we know the clocks. The headline is bullshit.

For the rest of it, what exact tests are those? Zen2 apparently gets proper AVX which will indeed boost certain workloads considerably.


----------



## Vayra86 (Nov 12, 2018)

Small disclaimer: *potentially* 29% higher than Zen, if nothing else gets in the way - which it always does.


----------



## R0H1T (Nov 12, 2018)

In case of AVX heavy benches, they will give similar real world throughput i.e. 29% or more. They pretty much doubled their AVX throughput in one go, the avg (across many other applications) could be half or a third of this.


londiste said:


> Well, that definitely is not IPC, at least not until we know the clocks. The headline is bullshit.
> 
> For the rest of it, what exact tests are those? Zen2 apparently gets proper AVX which will indeed boost certain workloads considerably.


AMD's probably given their best case performance numbers, why do you need to know the clocks when they've said the IPC is higher based on a *performance index*? Do you suppose they'll do an Intel here?


----------



## btarunr (Nov 12, 2018)

AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule. 

Lisa Su is very careful about the guidance she puts out.


----------



## kastriot (Nov 12, 2018)

This is 29% based on same clock speeds zen1vszen2 or boosted zen 2 core clock(Like 4.5-4.8GHz?)


----------



## londiste (Nov 12, 2018)

R0H1T said:


> AMD's probably given their best case performance numbers, why do you need to know the clocks when they've said the IPC is higher based on a *performance index*? Do you suppose they'll do an Intel here?


IPC = Instructions Per *Clock*.

Edit:
I was wrong, AMD does say these tests measure IPC.
http://ir.amd.com/news-releases/new...performance-datacenter-computing-next-horizon


> Estimated increase in instructions per cycle (IPC) is based on AMD internal testing for “Zen 2” across microbenchmarks, measured at 4.53 IPC for DKERN +RSA compared to prior “Zen 1” generation CPU (measured at 3.5 IPC for DKERN + RSA) using combined floating point and integer benchmarks.



Didn't Zen have hardware acceleration for RSA?


----------



## R0H1T (Nov 12, 2018)

londiste said:


> IPC = Instructions Per *Clock*.
> I mean, we sure use the term incorrectly already but the clock part there is still crucial. I suppose the Performance Index comes from test results. Tests are run at some clock speed which are much more likely to be higher than Zen/Zen+ results, especially as AMD themselves makes no note of IPC.


Yes but we don't even know what performance index indicates in this case, for instance do you know if the tests were carried out using fixed clocks? But when AMD says (officially?) that the IPC gain is about 30% they can't be lying about it, IPC is a specific term & AFAIK Intel & AMD know exactly what it means. The point being ~ take this application/result as a best case scenario given what we already know about Zen2 like better AVX, deriving anything more from the headline grabbing number is pointless.


----------



## randomUser (Nov 12, 2018)

Simple math.

If Zen1 IPC is 1.00
Zen2 IPC is 29% higher than Zen1, so it will be 1.29

This means, that:
Zen1 will handle 1 instruction per 1 clock cycle 
Zen2 will handle 1.29 instructions per 1 clock cycle.

If you your task requires 1000 instructions to be completed, then:
Zen1 will finish this task in 1000 clock cycles;
Zen2 will finish this task in 775 clock cycles.


----------



## TheGuruStud (Nov 12, 2018)

So 15% real world seems very doable. Oh, intel, luz. Better luck next time with your 15% in 8 yrs lol


----------



## Lionheart (Nov 12, 2018)

29% seems like a pipe dream but hey, I welcome it with open arms, I suspect 15% which is still a decent bump IMO


----------



## dj-electric (Nov 12, 2018)

If Zen2's gaming performance is similar per-core to coffee lake across the board, I'd have to slap my face a few times.
That would be waking up to a new reality, one that existed last time over 12 years ago. Point some guns at me, i have skepticism about that.


----------



## MDDB (Nov 12, 2018)

"(...) is part of the 8-core chiplets that aren't subdivided into CCX (8 cores per CCX). "

Is this confirmed, that the CCXs are 8 cores now? I don't think i've seen it explicited anywhere, would there be a source?


----------



## dj-electric (Nov 12, 2018)

MDDB said:


> "(...) is part of the 8-core chiplets that aren't subdivided into CCX (8 cores per CCX). "
> 
> Is this confirmed, that the CCXs are 8 cores now? I don't think i've seen it explicited anywhere, would there be a source?



It is clear as day from the design of new EPYC. It includes 8 chiplets of 8 cores each next to the IO controller to complete 64 cores.
The chiplets themselves are quite small, and 2 of them could very possibly fit into a dual-chiplet AM4 CPU with 16 cores.


----------



## bubbleawsome (Nov 12, 2018)

I like this news quite a bit. One of the quicker 6 core chips from this could be the replacement for my 4670k.


----------



## R0H1T (Nov 12, 2018)

dj-electric said:


> It is clear as day from the design of new EPYC. It includes 8 chiplets of 8 cores each next to the IO controller to complete 64 cores.
> The chiplets themselves are quite small, and 2 of them could very possibly fit into a dual-chiplet AM4 CPU with 16 cores.


It could still be 4 cores per CCX, from *AT* ~


Spoiler










The biggest downside from this being the insane number of *IF* links to make Rome


----------



## bug (Nov 12, 2018)

Prima.Vera said:


> Bulldozer, Excavator, ... no thank you. No more hyping until the community benches are out.


You're right to point out historically numbers in advance din't do AMD ant favors. However, in this case we already know there was work left to do mainly around the memory controller. Some at AMD confirmed this much around Zen launch. So we knew there was (at least theoretical) untapped potential in Zen. Of course, the proof is still in the pudding, but unlike Bulldozer and Excavator (which everyone knew were built on shaky ground), I believe AMD is at least worth the benefit of doubt this time around. Plus, even if an average the improvement isn't 29%, but 20%, it would still be enough to gain a solid lead on Intel.


----------



## dj-electric (Nov 12, 2018)

R0H1T said:


> It could still be 4 cores per CCX, from *AT* ~


Could very well be, but im not to sure how economically efficient it would be to separate them, since the die is much smaller one.
If ill have to bet, im taking a guess that they will always appear in full physical form, and of course AMD is going to take a freedom of shutting down cores, letting us also enjoy 10-12 core parts on AM4.

With Zen gen 1 they were huge compered to those.


----------



## EntropyZ (Nov 12, 2018)

I hope for AMD sake they aren't getting a bit overconfident. I'll wait until reviews come out to show how the improvements translate to performance gains in gaming and workstation workloads. They are surely keeping their momentum to steamroll Intel, they are winning some battles, but they haven't won the war.


----------



## Aquinus (Nov 12, 2018)

R0H1T said:


> The biggest downside from this being the insane number of *IF* links to make Rome


The biggest benefit of moving I/O off to a different die is that it makes the CCXs smaller if you don't make them bigger because all of that logic isn't in the CCX anymore and is instead located in the centralized I/O hub. Smaller dies means better yields, better yields means an opportunity to add more cores.

Personally my concern is with latency but, I'm not sure if that's an unfounded issue or not. It's likely the case that it's more beneficial to move the I/O components. It's also possible that the I/O hub might not need to be done on the same process as the CCXs which might further improve yields if the larger die is being done on a more mature process.

I'm interested to see how Rome turns out because if it turns out well, it means that AMD is keeping up the pace that started with the first Zen chips which is necessary to keep Intel on the offensive. If AMD can effectively double the number of cores without too much more cost, then Intel is going to remain on the defensive.

Intel: We can make mainstream 8c/16t CPUs too.
AMD: Hold my beer.


----------



## Assimilator (Nov 12, 2018)

Aquinus said:


> Intel: We can make mainstream 8c/16t CPUs too.
> AMD: Hold my beer.



TBH I wouldn't call the 9900K "mainstream" due to its heat, price and availability. It's pretty clearly showing the limit of the Core uarch on 14nm, and I suspect that its successor will only show up once 10nm is fixed.


----------



## TheGuruStud (Nov 12, 2018)

bubbleawsome said:


> I like this news quite a bit. One of the quicker 6 core chips from this could be the replacement for my 4670k.



It appears to be a waste of materials to make anything less than 8 core to me.


----------



## nemesis.ie (Nov 12, 2018)

@Aquinus It was confirmed at the NH event that the I/O chip is on 14nm.

My guess is that it could be from GF which keeps GF in the game.

@TheGuruStud I would think they will make all the chiplets 8c, but should still be able to cut them down for market segmentation and using ones with faulty parts. I'm sure that's what @bubbleawsome meant, buying a 6-core CPU that could be 1 x 8 core with 2 faulty cores or, if space allows on the AM4 package, potentially 2 x 8 cores with 10 faulty cores between them (the latter being less likely, those would more likely go to TR or Epyc parts depending on the clock speeds but it could be done).


----------



## bug (Nov 12, 2018)

Assimilator said:


> TBH I wouldn't call the 9900K "mainstream" due to its heat, price and availability. It's pretty clearly showing the limit of the Core uarch on 14nm, and I suspect that its successor will only show up once 10nm is fixed.


95W+ or scarcity are not new to the mainstream market 
Even the price is not that out of this world, but at $500 it won't gain 10% market share, so yeah, not that mainstream after all.


----------



## windwhirl (Nov 12, 2018)

I think I'll keep my hopes for IPC improvement at 10-15 percent. Nearly 30% improvement is a bit too much to ask, although if it happens, well, that'd be nice.


----------



## WikiFM (Nov 12, 2018)

dj-electric said:


> It is clear as day from the design of new EPYC. It includes 8 chiplets of 8 cores each next to the IO controller to complete 64 cores.
> The chiplets themselves are quite small, and 2 of them could very possibly fit into a dual-chiplet AM4 CPU with 16 cores.



It is clear that chiplets have 8 cores, not 8 cores per CCX, that hasn't been confirmed yet.



R0H1T said:


> It could still be 4 cores per CCX, from *AT* ~
> 
> 
> Spoiler
> ...



Very pretty topology, where does it come from?



bug said:


> You're right to point out historically numbers in advance din't do AMD ant favors. However, in this case we already know there was work left to do mainly around the memory controller. Some at AMD confirmed this much around Zen launch. So we knew there was (at least theoretical) untapped potential in Zen. Of course, the proof is still in the pudding, but unlike Bulldozer and Excavator (which everyone knew were built on shaky ground), I believe AMD is at least worth the benefit of doubt this time around. Plus, even if an average the improvement isn't 29%, but 20%, it would still be enough to gain a solid lead on Intel.



Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.



Aquinus said:


> The biggest benefit of moving I/O off to a different die is that it makes the CCXs smaller if you don't make them bigger because all of that logic isn't in the CCX anymore and is instead located in the centralized I/O hub. Smaller dies means better yields, better yields means an opportunity to add more cores.
> 
> Personally my concern is with latency but, I'm not sure if that's an unfounded issue or not. It's likely the case that it's more beneficial to move the I/O components. It's also possible that the I/O hub might not need to be done on the same process as the CCXs which might further improve yields if the larger die is being done on a more mature process.



So what gives better yields then? Smaller dies at 7nm or a huge one at 14nm? Yes the I/O die is done in GloFo's 14 nm.


----------



## Vayra86 (Nov 12, 2018)

WikiFM said:


> It is clear that chiplets have 8 cores, not 8 cores per CCX, that hasn't been confirmed yet.
> 
> 
> 
> ...



15-20% is what they need to catch Intel clock-for-clock. Zen was way behind on *clocks*, not on IPC. But combine the two and you have a gap, yes. I do believe Zen 2 will comfortably close that gap, if it can clock to 4.5 ~ 4.6, Intel has nothing left to offer.


----------



## WikiFM (Nov 12, 2018)

Vayra86 said:


> 20% will put them on the level of Coffee Lake, give or take some insignificant workload specific gaps. Way behind on IPC? Not at all. Zen was way behind on *clocks*.



So CFL is clock to clock similar to Zen in IPC? Or in addition to higher IPC they clocked much faster? Anyway if Zen 2 can catch CFL, Intel should cancel Cannon Lake and launch Ice Lake next year to keep having the leadership. Intel should have published some preliminary data about IPC gains of Ice Lake by now.


----------



## Vayra86 (Nov 12, 2018)

WikiFM said:


> So CFL is clock to clock similar to Zen in IPC? Or in addition to higher IPC they clocked much faster? Anyway if Zen 2 can catch CFL, Intel should cancel Cannon Lake and launch Ice Lake next year to keep having the leadership. Intel should have published some preliminary data about IPC gains of Ice Lake by now.



Excuse my ninja edits.

CFL is ahead of Zen (1) and Zen 2 will probably close that gap, yes. Hopefully not just IPC but also clocks.

Intel should do a lot of things, but the reality is they have nothing on the table unless they can move to a smaller node.


----------



## Octopuss (Nov 12, 2018)

I don't care if it's only 10% above Zen+. I already considered buying the +, so this will only be better.


----------



## Caqde (Nov 12, 2018)

WikiFM said:


> Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.



They trade blows in the IPC department with the worst case AMD being 15% behind and best case 8% ahead. So depending on how things go with Zen 2 then it is possible that Zen 2 depending on the task will at least be level with Intel and in most cases be ahead in IPC. In the case of a 20% average IPC increase that would mean that clock for clock AMD would always be faster than any Coffeelake chip out there. But if this 29% increase is true then Intel has problems as even in the worst case with 85% of the performance a 29% boost means AMD is now ~9.7% faster clock for clock (20% would mean 2% faster).

For the source of this info ->
https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/


----------



## bug (Nov 12, 2018)

WikiFM said:


> Could 20% be enough to have a lead on Intel? I thought Zen was still way behind Intel in single threaded performance or IPC.



Neah, Zen's IPC is neck to neck with Intel's. Intel wins in singe-thread performance because having fewer cores they can push higher frequencies. But since they can't push 20% higher frequencies, 20% better IPC (even if maintaining the same clocks) will be enough to push AMD ahead.

(And yes, I'm aware there are specific scenarios where the IPC gap can be noticeable, but I'm talking about the average usecase here).


----------



## WikiFM (Nov 12, 2018)

Vayra86 said:


> Excuse my ninja edits.
> 
> CFL is ahead of Zen (1) and Zen 2 will probably close that gap, yes. Hopefully not just IPC but also clocks.
> 
> Intel should do a lot of things, but the reality is they have nothing on the table unless they can move to a smaller node.



Intel should have (re)designed Ice Lake arch on 14+(++,+++) nm. It would be in the market by now, but they are so stubborn that the next arch will come till 10 nm. With that in mind next arch after Ice Lake would come in 7 nm by 2025?


----------



## beautyless (Nov 12, 2018)

I want AMD 8 cores that is as fast as 9900K and prices 350usd.


----------



## Gungar (Nov 12, 2018)

windwhirl said:


> I think I'll keep my hopes for IPC improvement at 10-15 percent. Nearly 30% improvement is a bit too much to ask, although if it happens, well, that'd be nice.



Don't worry 10-15 percent IPC increase is already pipe dream. And i am not talking about specific application performance bump bullshit.


----------



## qcmadness (Nov 12, 2018)

29% IPC uplift claim is too much if the previous claim of "no dignificant bottleneck" of Zen is true.


----------



## Fabio (Nov 12, 2018)

it will be a goal if amd will be on par with intel, ipc wise. X86 is a more then mature arch., any improvement can only be small improvement. Yes improve latencies etc can be important in some scenarios, but 29% more ipc is madness. Sure, zen done +40% but we here we have excavator as a refer...


----------



## WikiFM (Nov 12, 2018)

Caqde said:


> They trade blows in the IPC department with the worst case AMD being 15% behind and best case 8% ahead. So depending on how things go with Zen 2 then it is possible that Zen 2 depending on the task will at least be level with Intel and in most cases be ahead in IPC. In the case of a 20% average IPC increase that would mean that clock for clock AMD would always be faster than any Coffeelake chip out there. But if this 29% increase is true then Intel has problems as even in the worst case with 85% of the performance a 29% boost means AMD is now ~9.7% faster clock for clock (20% would mean 2% faster).
> 
> For the source of this info ->
> https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/



Just read the review, very nice but my conclusions are different than yours, the only win for Ryzen 2600X was PCMark in Gaming Score hahaha, that 8%. Ryzen 2600X is 5% slower on average on productivity and apps and 12% slower in gaming, against 8700K both a 4Ghz.



bug said:


> Neah, Zen's IPC is neck to neck with Intel's. Intel wins in singe-thread performance because having fewer cores they can push higher frequencies. But since they can't push 20% higher frequencies, 20% better IPC (even if maintaining the same clocks) will be enough to push AMD ahead.
> 
> (And yes, I'm aware there are specific scenarios where the IPC gap can be noticeable, but I'm talking about the average usecase here).



Check that review https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/page2.html you should find out that Ryzen 2600X is still behind Intel 8700K.


----------



## BorgOvermind (Nov 12, 2018)

Rome: 2x  FP performance increase per core and FP increase per socket. That is significant even if it does not translate into real-world benchmarks.

Intel at a point was in the lead with 2 manufacturing steps. 
Now Intel has nothing to answer this with and is behind in every aspect except marketing dirty tricks (oh... 'deals').


----------



## bug (Nov 12, 2018)

WikiFM said:


> Check that review https://www.techspot.com/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/page2.html you should find out that Ryzen 2600X is still behind Intel 8700K.



You see "still behind", I see  "neck and neck".


----------



## Valantar (Nov 12, 2018)

R0H1T said:


> It could still be 4 cores per CCX, from *AT* ~
> 
> 
> Spoiler
> ...


While you're right that we don't know yet that the CCXes have grown to 8 cores (though IMO this seems likely given that every other Zen2 rumor has been spot on), that drawing is ... nonsense. First off, it proposes using IF to communicate between CCXes on the same die, which even Zen1 didn't do. The sketch directly contradicts what AMD said about their design, and doesn't at all account for the I/O die and its role in inter-chiplet communication. The layout sketched out there is incredibly complicated, and wouldn't even make sense for a theoretical Zen1-based 8-die layout. Remember, IF uses PCIe links, and even in Zen1 the PCIe links were common across two CCXes. The CCXes do thus not have separate IF links, but share a common connection (through the L3 cache, IIRC) to the PCIe/IF complex. Making these separate would be a _giant _step backwards in terms of design and efficiency. Remember, the uncore part of even a 2-die Threadripper consumes ~60W. And that's with two internal links, 64 lanes of PCIe and a quad-channel memory controller. The layout in the sketch above would likely consume >200W for IF alone.

Now, let's look at that sketch. In it, any given CCX is one hop away from 3-4 other CCXes, 2 hops from 3-5 CCXes, and 3 hops away from the remaining 7-10 CCXes. In comparison, with EPYC (non-Rome) and TR, all cores are 1 hop away from each other (though the inter-CCX hop is shorter/faster than the die-to-die IF hop). Even if this is "reduced latency IF" as they call it, that would be _ridiculous_. And again: what role does the I/O die play in this? The IF layout in that sketch makes no use of it whatsoever, other than linking the memory controller and PCIe lanes to eight seemingly random CCXes. This would make NUMA management an impossible flustercuck on the software side, and substrate manufacturing (seriously, there are _six IF links _in between each chiplet there! The chiplets are <100mm2! This is a PCB, not an interposer! You can't get that kind of trace density in a PCB.) impossible on the hardware side. Then there's the issue of this design requiring each CCX to have 4 IF links, but 1/4 of the CCXes only gets to use 3 links, wasting die area.

On the other hand, let's look at the layout that makes sense both logically, hardware and software wise, and adds up with what AMD has said about EPYC: Each chiplet has a single IF interface, that connects to the I/O die. Only that, nothing more. The I/O die has a ring bus or similar interconnect that encompasses the 8 necessary IF links for the chiplets, an additional 8 for PCIe/external IF, and the memory controllers. This reduces the number of IF links running through the substrate from 30 in your sketch (6 per chiplet pair + 6 between them) to 8. It is blatantly obvious that the I/O die has been made specifically to make this possible. This would make every single core 1 hop (through the I/O die, but ultimately still 1 hop) away from any other core, while reducing the number of IF links by almot 1/4. Why else would they design that _massive _die?

Red lines. The I/O die handles low-latency shuffling of data between IF links, while also giving each chiplet "direct" access to DRAM and PCIe. All over the same single connection per chiplet. The I/O die is (at least at this time) a black box, so we don't know whether it uses some sort of ring bus, mesh topology, or large L4 cache (or some other solution) to connect these various components. But we do know that a layout like this is the only one that would actually work. (And yes, I know that my lines don't add up in terms of where the IF link is physically located on the chiplets. This is an illustration, not a technical drawing.)






More on-topic, we need to remember that IPC is workload dependent. There might be a 29% increase in IPC in certain workloads, but generally, when we talk about IPC it is _average_ IPC across a wide selection of workloads. This also applies when running test suites like SPEC or GeekBench, as they run a wide variety of tests stressing various parts of the core. What AMD has "presented" (it was in a footnote, it's not like they're using this for marketing) is from two specific workloads. This means that a) this can very likely be true, particularly if the workloads are FP-heavy, and b) this is very likely not representative of total average IPC across most end-user-relevant test suites. In other words, this can be both true (in the specific scenarios in question) and misleading (if read as "average IPC over a broad range of workloads").


----------



## btarunr (Nov 12, 2018)

dj-electric said:


> The chiplets themselves are quite small, and 2 of them could very possibly fit into a dual-chiplet AM4 CPU with 16 cores.



There are two ways AMD could built a 16-core AM4 processor: 
Two 8-core chiplets with a smaller I/O die that has 2-channel memory, 32-lane PCIe gen 4.0 (with external redrivers), and the same I/O as current AM4 dies such as ZP or PiR.
A monolithic die with two 8-core CCX's, and fully integrated chipset like ZP or PiR. Such a die wouldn't be any bigger than today's PiR.
I think option two is more feasible for low-margin AM4 products.


----------



## bug (Nov 12, 2018)

btarunr said:


> There are two ways AMD could built a 16-core AM4 processor:
> 
> Two 8-core chiplets with a smaller I/O die that has 2-channel memory, 32-lane PCIe gen 4.0 (with external redrivers), and the same I/O as current AM4 dies such as ZP or PiR.
> A monolithic die with two 8-core CCX's, and fully integrated chipset like ZP or PiR. Such a die wouldn't be any bigger than today's PiR.
> I think option two is more feasible for low-margin AM4 products.


At the same time, for low-margins 8 core is more than enough 
But let's wait and see.


----------



## btarunr (Nov 12, 2018)

bug said:


> At the same time, for low-margins 8 core is more than enough
> But let's wait and see.



AMD wants to moar-koar the sh** out of Intel's R&D budget, so they spend their money on moar-koaring to keep up, because software ecosystem is finally waking up to moar-koar. At the same time, it's mindful that when Intel gets its 10 nm off the ground, it will introduce its first major IPC uplifts since 2015, or perhaps even since Nehalem. So it needs double-digit percentage IPC increments in addition to 100% core-count increases across the board, while keeping the energy-efficiency edge from 7 nm.

It's somewhat like the USA-PRC military equation. For every dollar that China spends on developing a new military technology, the US probably spends $5 to keep its edge (thanks to lubricating K-street, the hill, MIC, higher costs, etc.).


----------



## Smartcom5 (Nov 12, 2018)

bug said:


> Neah, Zen's IPC is neck to neck with Intel's. Intel wins in singe-thread performance because having fewer cores they can push higher frequencies. But since they can't push 20% higher frequencies, 20% better IPC (even if maintaining the same clocks) will be enough to push AMD ahead.
> 
> (And yes, I'm aware there are specific scenarios where the IPC gap can be noticeable, but I'm talking about the average usecase here).


Excuse me sir, but you _misspelled_ *IPS*! _When people will finally learn the difference ffs?!_

There's the _IPC_, and then there's _IPS_.
*IPC* or I/c → *I*nstructions *p*er (Clock-) *C*ycle
*IPS* or I/s → *I*nstructions *p*er *S*econd

The letter one, thus IPS, often is used synonymously with and for actual _Single-thread-Performance_ – whereas AMD no longer and surely _not to such an extent_ lags behind in numbers compared to Intel now as they did at the time Bulldozer was the pinnacle of the ridge.

*Rule of thumb:*
IPC _does not scale_ with frequency but is rather _fix·ed_ (within margins, depends on context and kind of [code-] instructions¹, you got the idea).
IPS is the _fixed_ value of the IPC in a time-relation or at a time-figure pretty much like the formula → `IPC×t`, simply put.

So your definition of IPC quoted above would rather be called „*I*nstructions *p*er *C*lock at the *W*all“ like IPC@W.
So please, stop using _right_ terms and definitions for _wrong_ contexts, learn the difference between those two and get your shit together please! 



¹ The value IPC is (depending on kind) absolute² and fixed, yes.
However, it completely _is_ crucially depending on the _type and kind of instructions_ and can vary rather _stark_ by using different kind of instructions – since, per definition, the figure IPC only reflects the value of how many instructions can be processed _on average per (clock-) circle_.

On synthetic code like instructions with low logical depth or level and algorithmic complexity, which are suited to be processed rather shortly, the resulting value is obviously pretty high – whereas on instructions with a rather high complexity and long length, the IPC-value can only reach rather low figures. In this particular matter, even the contrary can be the case, so that it needs _more than one or even a multitude of cycles_ to process a single given complex instruction. In this regard we're speaking of the reciprocal multiplicative, thus the inverse (-value).
… which is also standardised as being defined as (Clock-) *C*ycles *p*er *I*nstruction or C/I, short → CPI.
² In terms of _non-varying_, as opposed to _relative_.

_Read_:
Wikipedia • Instructions per cycle
Wikipedia • Instructions per second
Wikipedia • Cycles per instruction


Smartcom


----------



## intelzen (Nov 12, 2018)

btarunr said:


> when Intel gets its 10 nm off the ground, it will introduce its first major IPC uplifts since 2015, or perhaps even since Nehalem


since Sany Bridge it was year 2009 no more than 5% IPC gains from Intel, and in last 2 "generations" = 0% IPC gains... lets hope it will be in early 2020.


----------



## Valantar (Nov 12, 2018)

btarunr said:


> AMD wants to moar-koar the sh** out of Intel's R&D budget, so they spend their money on moar-koaring to keep up, because software ecosystem is finally waking up to moar-koar. At the same time, it's mindful that when Intel gets its 10 nm off the ground, it will introduce its first major IPC uplifts since 2015, or perhaps even since Nehalem. So it needs double-digit percentage IPC increments in addition to 100% core-count increases across the board, while keeping the energy-efficiency edge from 7 nm.
> 
> It's somewhat like the USA-PRC military equation. For every dollar that China spends on developing a new military technology, the US probably spends $5 to keep its edge (thanks to lubricating K-street, the hill, MIC, higher costs, etc.).


While you have a point, wouldn't that also mean using partially disabled 16-core dice for even =/< 8-core chips (including the low end) given that this would then be the only chip with the required I/O? This sounds too inflexible to make sense for the wide range of SKUs needed for this market. Even if they push high-end MSDT to 16 cores, majority sales volume will be in the 4-6 core range (unless these chips are _crazy _cheap), with 8 cores likely being the enthusiast sweet spot. That would require _a lot_ of partially disabled silicon. As such, doesn't it sound more likely to keep the chiplets across the range (possibly excluding mobile)? This might be slightly more expensive in assembly, but on the other hand disabling >/= 50% of your die for 80-90% of your sales doesn't exactly make economical sense either. I'd bet the former would be cheaper than the latter, as you'd get more than 2x the usable dice out of a wafer this way.


----------



## Gasaraki (Nov 12, 2018)

Prima.Vera said:


> Bulldozer, Excavator, ... no thank you. No more hyping until the community benches are out.



Remember when Ryzen first came out? That shit was hyped through the roof.



TheGuruStud said:


> So 15% real world seems very doable. Oh, intel, luz. Better luck next time with your 15% in 8 yrs lol



So HOW long did AMD take to get "here" (Zen+)? They are still not ahead. We shall see Zen 2.


----------



## Vayra86 (Nov 12, 2018)

WikiFM said:


> Intel should have (re)designed Ice Lake arch on 14+(++,+++) nm. It would be in the market by now, but they are so stubborn that the next arch will come till 10 nm. With that in mind next arch after Ice Lake would come in 7 nm by 2025?



Should have... would they be able to? A new node enables a new design I think and the compromises to do it on 14nm would kill the advantage anyway. 14nm is clearly pushed to the limit, and even over it for some parts if you look at their stock temps, (9th gen hi).



Smartcom5 said:


> Excuse me sir, but you _misspelled_ *IPS*! _When people will finally learn the difference ffs?! _




Eh... IPS in my mind is In Plane Switching for displays.

He spelled it fine, you didn't read it right.


----------



## bug (Nov 12, 2018)

Smartcom5 said:


> Excuse me sir, but you _misspelled_ *IPS*! _When people will finally learn the difference ffs?!_
> 
> There's the _IPC_, and then there's _IPS_.
> *IPC* or I/c → *I*nstructions *p*er (Clock-) *C*ycle
> ...


No, I meant just what I said/wrote


----------



## HD64G (Nov 12, 2018)

Valantar said:


> While you're right that we don't know yet that the CCXes have grown to 8 cores (though IMO this seems likely given that every other Zen2 rumor has been spot on), that drawing is ... nonsense. First off, it proposes using IF to communicate between CCXes on the same die, which even Zen1 didn't do. The sketch directly contradicts what AMD said about their design, and doesn't at all account for the I/O die and its role in inter-chiplet communication. The layout sketched out there is incredibly complicated, and wouldn't even make sense for a theoretical Zen1-based 8-die layout. Remember, IF uses PCIe links, and even in Zen1 the PCIe links were common across two CCXes. The CCXes do thus not have separate IF links, but share a common connection (through the L3 cache, IIRC) to the PCIe/IF complex. Making these separate would be a _giant _step backwards in terms of design and efficiency. Remember, the uncore part of even a 2-die Threadripper consumes ~60W. And that's with two internal links, 64 lanes of PCIe and a quad-channel memory controller. The layout in the sketch above would likely consume >200W for IF alone.
> 
> Now, let's look at that sketch. In it, any given CCX is one hop away from 3-4 other CCXes, 2 hops from 3-5 CCXes, and 3 hops away from the remaining 7-10 CCXes. In comparison, with EPYC (non-Rome) and TR, all cores are 1 hop away from each other (though the inter-CCX hop is shorter/faster than the die-to-die IF hop). Even if this is "reduced latency IF" as they call it, that would be _ridiculous_. And again: what role does the I/O die play in this? The IF layout in that sketch makes no use of it whatsoever, other than linking the memory controller and PCIe lanes to eight seemingly random CCXes. This would make NUMA management an impossible flustercuck on the software side, and substrate manufacturing (seriously, there are _six IF links _in between each chiplet there! The chiplets are <100mm2! This is a PCB, not an interposer! You can't get that kind of trace density in a PCB.) impossible on the hardware side. Then there's the issue of this design requiring each CCX to have 4 IF links, but 1/4 of the CCXes only gets to use 3 links, wasting die area.
> 
> ...


Agreed. Interesting graph but and I also think it has mistakes. AMD put this central die in the middle of the chiplets to allow all of them be as close as possible to it. And they put the memory controller there also to cancel the need of those chiplets to communicate at all. The CPU will use as many cores as needed by the sw and use the IO chip to do the rest. And that is why imho this arch is brilliant and the only way to increase core count without increase latency to the moon. We are warching a true revolution in computing here. My 5 cents.


----------



## bug (Nov 12, 2018)

Vayra86 said:


> Eh... IPS in my mind is In Plane Switching for displays.
> 
> He spelled it fine, you didn't read it right.



Happens to me too from time to time. Especially when I read or post in a hurry.


----------



## Markosz (Nov 12, 2018)

Oh, investor meeting... then let's take half of what they said


----------



## Valantar (Nov 12, 2018)

Smartcom5 said:


> Excuse me sir, but you _misspelled_ *IPS*! _When people will finally learn the difference ffs?!_





Vayra86 said:


> Eh... IPS in my mind is In Plane Switching for displays.
> 
> He spelled it fine, you didn't read it right.


Agreed. There's nothing wrong with saying "Intel has a clock speed advantage, but AMD might beat them in actual performance through increasing IPC." There's nothing in that saying that clock speed affects IPC, only that clock speed is a factor in actual performance. Which it is. What @Smartcom5 is calling "IPS" is just actual performance (which occurs in the real world, and thus includes time as a factor, and thus also clock speed) and not the intentional abstraction that IPC is. This seems like a fundamental misunderstanding of why we use the term IPC in the first place (to separate performance from the previous misunderstood oversimplification that was "faster clocks=more performance").


----------



## Dante Uchiha (Nov 12, 2018)

btarunr said:


> There are two ways AMD could built a 16-core AM4 processor:
> 
> Two 8-core chiplets with a smaller I/O die that has 2-channel memory, 32-lane PCIe gen 4.0 (with external redrivers), and the same I/O as current AM4 dies such as ZP or PiR.
> A monolithic die with two 8-core CCX's, and fully integrated chipset like ZP or PiR. Such a die wouldn't be any bigger than today's PiR.
> I think option two is more feasible for low-margin AM4 products.



That's not realistic. 16c is not feasible for consumers:

-16c with high clocks would have a high TDP, the current motherboards would have been problems to support them. 
-16c would have to be double the current value of the 2700x, and even then AMD would have a lower profit/cpu sold.
- 8c CPU is more than enough for gaming, even for future releases. 

Would you buy a 3700x @ 16c at U$ 599~ ? Or would be better a 3700x with "just 8c", low latency, optimized for gaming at U$ 349~399 ?


----------



## R0H1T (Nov 12, 2018)

We're not getting 32 PCIe 4.0 lanes on AM4, I'd be (really) shocked if that were the case.

*Valantar *with the entire I/O & MC off the die it opens up a world of possibilities with Zen, having said that I'll go back again to the point I made in other threads. The 8 core CCX makes sense for servers & perhaps HEDT, however when it comes to APU (mainly *notebooks*) I don't see a market for 8 cores there. I also don't see AMD selling an APU with 6/4 cores disabled, even if it is high end desktop/notebooks.

The point I'm making is that either AMD makes two CCX, one with 8 cores & the other with 4, or they'll probably go with the same 4 core CCX. The image I posted is probably misconstrued, I also don't know for certain if the link shown inside the die is *IF* or just a logical connection (via L3?) between 2 CCX.


----------



## Valantar (Nov 12, 2018)

HD64G said:


> Interesting graph but I think it has mistakes. AMD put this central die in the middle of the chiplets to allow all of them be as close as possible to it. And they put the memory controller there also to cancel the need of those chiplets to communicate at all. The CPU will use as many cores as needed by the sw and use the IO chip to do the rest. And that is why imho this arch is brilliant and the only way to increase core count without increase latency to the moon. We are warching a true revolution in computing here. My 5 cents.


You're phrasing this as if you're arguing against me, yet what you're saying is exactly what I'm saying. Sounds like you're replying to the wrong post. The image I co-opted came from the quoted post, I just sketched in how I believe they'll lay this out.


----------



## TheinsanegamerN (Nov 12, 2018)

If AMD managed a 15% IPC increase over OG zen, I would be amazed. I was expecting around 10%. 

There is no way they will hit 20-29%. That is just wishful thinking on AMD's part, most likely in specific scenarios. 

Of course, I'd love to e proved wrong here.


----------



## Valantar (Nov 12, 2018)

R0H1T said:


> We're not getting 32 PCIe 4.0 lanes on AM4, I'd be (really) shocked if that were the case.
> 
> *Valantar *with the entire I/O & MC off the die it opens up a world of possibilities with Zen, having said that I'll go back again to the point I made in other threads. The 8 core CCX makes sense for servers & perhaps HEDT, however when it comes to APU (mainly *notebooks*) I don't see a market for 8 cores there. I also don't see AMD selling an APU with 6/4/2 cores disabled, even if it is high end desktop/notebooks.
> 
> The point I'm making is that either AMD makes two CCX, one with 8 cores & the other with 4, or they'll probably go with the same 4 core CCX. The image I posted is probably misconstrued, I also don't know for certain if the link shown inside the die is *IF* or just a logical connection between 2 CCX.


I partially agree with that - it's very likely they'll put out a low-power 4-ish-core chiplet for mobile. After all, the mobile market is bigger than the desktop market, so it makes more sense for this to get bespoke silicon. What I disagree with is the need for the 8-core to be based off the same CCX as the 4-core. If they can make an 8-core CCX, equalising latencies between cores on the same die, don't you think they'd do so? I do, as that IMO qualifies as "low-hanging fruit" in terms of increasing performance from Zen/Zen+. This would have performance benefits for every single SKU outside of the mobile market. And, generally, it makes sense to assume that scaling down core count per CCX is no problem, so having an 8-core version is no hindrance to also having a 4-core version.

How I envision AMD's Zen2 roadmap:

Ryzen Mobile:
15-25W: 4-core chiplet + small I/O die (<16 lanes PCIe, DC memory, 1-2 IF links), either integrated GPU on the chiplet or separate iGPU chiplet
35-65W: 8-core chiplet + small I/O die (<16 lanes PCIe, DC memory, 1-2 IF links), separate iGPU chiplet or no iGPU (unlikely, iGPU useful for power savings)

Ryzen Desktop:
Low-end: 4-core chiplet + medium I/O die (< 32 lanes PCIe, DC memory, 2 IF links), possible iGPU (either on-chiplet or separate)
Mid-range: 8-core chiplet + medium I/O die (< 32 lanes PCIe, DC memory, 2 IF links), possible iGPU on specialized SKUs
High-end: 2x 8-core chiplet + medium I/O die (< 32 lanes PCIe, DC memory, 2 IF links)

Threadripper:
(possible "entry TR3": 2x 8-core chiplet + large I/O die (64 lanes PCIe, QC memory, 4 IF links), though this would partially compete with high-end Ryzen just with more RAM B/W and PCIe and likely only have a single 16-core SKU, making it unlikely to exist)
Main: 4x 8-core chiplet + large I/O die (64 lanes PCIe, QC memory, 4 IF links)

EPYC:
Small: 4x 8-core chiplet + XL I/O die (128 lanes PCIe, 8C memory, 8 IF links)
Large: 8x 8-core chiplet + XL I/O die (128 lanes PCIe, 8C memory, 8 IF links)

Uncertiainty:
-Mobile might go with an on-chiplet iGPU and only one IF link on the I/O die, but this would mean no iGPU on >4-core mobile SKUs (unless they make a third chiplet design), while Intel already has 6-cores with iGPUs. As such, I'm leaning towards 2 IF links and a separate iGPU chiplet for ease of scaling, even if the I/O die will be slightly bigger and IF power draw will increase.

Laying out the roadmap like this has a few benefits:
-Only two chiplet designs across all markets.
-Scaling happens through I/O dice, which are made on an older process, are much simpler than CPUs, and should thus be both quick and cheap to make various versions of.
-A separate iGPU chiplet connected through IF makes mobile SKUs easier to design, and the GPU die might be used in dGPUs also.
-Separate iGPU chiplets allow for multiple iGPU sizes - allowing more performance on the high end, or less power draw on the low end.
-Allows for up to 8-core chips with iGPUs in both mobile and desktop.

Of course, this is all pulled straight out of my rear end. Still, one is allowed to dream, no?



TheinsanegamerN said:


> If AMD managed a 15% IPC increase over OG zen, I would be amazed. I was expecting around 10%.
> 
> There is no way they will hit 20-29%. That is just wishful thinking on AMD's part, most likely in specific scenarios.
> 
> Of course, I'd love to e proved wrong here.


Well, they claim to have _measured _a 29.4% increase. That's not wishful _thinking_ at least. But as I pointed out in a previous post:


Valantar said:


> We need to remember that IPC is workload dependent. There might be a 29% increase in IPC in certain workloads, but generally, when we talk about IPC it is _average_ IPC across a wide selection of workloads. This also applies when running test suites like SPEC or GeekBench, as they run a wide variety of tests stressing various parts of the core. What AMD has "presented" (it was in a footnote, it's not like they're using this for marketing) is from two specific workloads. This means that a) this can very likely be true, particularly if the workloads are FP-heavy, and b) this is very likely not representative of total average IPC across most end-user-relevant test suites. In other words, this can be both true (in the specific scenarios in question) and misleading (if read as "average IPC over a broad range of workloads").


----------



## TheinsanegamerN (Nov 12, 2018)

Valantar said:


> Well, they claim to have _measured _a 29.4% increase. That's not wishful _thinking_ at least. But as I pointed out in a previous post:


AMD also "claimed" to have dramatically faster CPUs with bulldozer, and "claimed" Vega would be dramatically faster then it ended up being. AMD here "claims" to have measured a 29.4% increase in IPC. But that might have been in a workload that uses AVX, and is heavily threaded, or somehow built to take full advantage of ryzen.

I'll wait for third party benchmarks. AMD has made way too many *technically true claims over the years.

*Technically true in one specific workload, overall the performance boost was less then half what AMD claimed, but it was true in one workload, so technically they didnt lie.


----------



## Vya Domus (Nov 12, 2018)

randomUser said:


> If you your task requires 1000 instructions to be completed, then:
> Zen1 will finish this task in 1000 clock cycles;
> Zen2 will finish this task in 775 clock cycles.



That's not how this works, not all instruction see the same improvement.



TheinsanegamerN said:


> I'll wait for third party benchmarks. AMD, *Intel, Nvidia have* made way too many *technically true claims over the years.



Fixed it.


----------



## GlacierNine (Nov 12, 2018)

Valantar said:


> I partially agree with that - it's very likely they'll put out a low-power 4-ish-core chiplet for mobile. After all, the mobile market is bigger than the desktop market, so it makes more sense for this to get bespoke silicon. What I disagree with is the need for the 8-core to be based off the same CCX as the 4-core. If they can make an 8-core CCX, equalising latencies between cores on the same die, don't you think they'd do so? I do, as that IMO qualifies as "low-hanging fruit" in terms of increasing performance from Zen/Zen+. This would have performance benefits for every single SKU outside of the mobile market. And, generally, it makes sense to assume that scaling down core count per CCX is no problem, so having an 8-core version is no hindrance to also having a 4-core version.



I disagree, for one very simple reason - Tooling up production for 2 different physical products/dies would likely be more expensive than the material savings in not using as much silicon per product. This stuff is not cheap to do, and in CPU manufacture, volume savings are almost always much more dramatic than design/material savings.

Serving Mainstream, HEDT, and Server customers from a single die integrated into multiple packages, is one of the main reasons AMD are in such good shape right now - Intel has to produce their Mainstream, LCC, HCC, and XCC dies and then bin and disable cores on all 4 of them for each market segment. AMD only has to produce and bin one die, to throw onto a variety of packages at *every level* of their product stack.

It's not even worth producing a second die unless the move would bring in not only more profit, but enough extra profit to completely cover the cost of tooling up for that. Bear in mind here that I mean something very specific:

If AMD spends 1bn to produce a second die, and rakes in 1.5bn extra profit over last year, that doesn't necessarily mean tooling up for the extra die was worth it. What if their profits still would have gone up by 1bn anyway, using a single die in production? If that were the case, tooling up just cost AMD a cool $1,000,000,000 in order to make $500,000,000. Sure, they might have gained a bit more marketshare, but not only did it lose them money, it also ended up making their product design procedures more complex and caused additional overheads right the way up through every level of the company, keeping track of the two independent pieces of silicon. It also probably means having further stratification in motherboards and chipsets, whereas right now AMD are very flexible in what they can do to bring these packages to older chipsets or avoid bringing in new ones.

Edit: Not to mention, that using a single, much higher capability die, has other benefits - Like for example being able to provide customers with a *much* longer support period for upgrades - something that has already won them sales with their "AM4 until 2020" approach bringing in consumers who are sick of Intel's socket and chipset-hopping. 

Or simply being able to unlock CCXs on new products as and when the market demands that - After all, why would you intentionally design a product that reduces your ability to respond to competition, when your competition is Intel, who you *know* are scrambling to use their higher R&D budget to smack you down again before you get too far ahead?


----------



## dirtyferret (Nov 12, 2018)

I could "potentially" be making 29% more money next year if the company owner doesn't get in the way.


----------



## B-Real (Nov 12, 2018)

Prima.Vera said:


> Bulldozer, Excavator, ... no thank you. No more hyping until the community benches are out.


You will see this after they are released.  Even if there will be only a ~10% increase from Zen+, they will be on par with Intel in FHD games tested with a 2080Ti.


----------



## GlacierNine (Nov 12, 2018)

bug said:


> 95W+ or scarcity are not new to the mainstream market
> Even the price is not that out of this world, but at $500 it won't gain 10% market share, so yeah, not that mainstream after all.


"95W+" is a bit misleading. Nobody should be looking at the 9900K and pretending it's simply a return to the hotter chips of yore. The fact is, it's actually a dramatically hotter chip than almost anything that has come before it, and the only reason we're able to tame it is because the coolers we use these days are so much more capable. At the time we were dealing with Intel Prescott chips, one of the best coolers you could buy was the Zalman CNPS9500. Noctua were only just about to release the *first* NH-U12. The undisputed king of the hill for air cooling was the Tuniq Tower 120, soon to be displaced by the original Thermalright Ultra 120.

The NH-D15 didn't exist. There were no AIOs of any kind, and that's why back then, we all struggled to cool Prescott Cores and first Gen i7s.

For example, The i7 975 was a 130W part. The fastest Pentium 4 chips were officially 115W. Intel's Datasheets of that time don't specify how TDP was calculated, but if we assume that they were doing what they do now, which is quote TDP at base clocks under a "close to worst case" workload, then we're probably in good shape.

The i7-975 then, had a 3,333MHz base clock, a 3.467 All-Core boost, and a 3.6GHz single core boost. Not a lot of boost happening here, only an extra 133MHz on all cores. You'd expect no real increase in temperatures under your cooler from such a mild overclock, unless you were OC'ing something like an old P3, so we can probably assume that means the Intel TDP from then, if measured according to today's standards, was probably pretty close to "correct" - You could expect your i7 975 to stick pretty close to that 130W TDP figure in a real world scenario. And this was legitimately a hard to cool chip! Even the best air coolers sometimes struggled.

Compare that to the 9900K, which is breaking 150W power consumption all over the internet, and you suddenly realise - The only reason these chips are surviving in the wild is because:

1 - Intel's current Arch will maintain it's maximum clocks way up into the 90+ Celsius range
2 - People are putting them under NH-D15s - and even then we're seeing temperature numbers that, back in the P4 days, would have been considered "Uncomfortable" and "dangerous".

The 9900K is, as far as I can tell, simply the most power hungry and hard to cool processor that Intel has ever released on a mainstream platform. It runs at the *ragged edge* of acceptability. You can't just brush this sort of thing off with "The market has seen 95W chips before". That's not what the 9900K actually is. It's something much, much more obscene.


----------



## Smartcom5 (Nov 12, 2018)

bug said:


> No, I meant just what I said/wrote


Gosh, I'm really sorry, was my bad! 



Picked the wrong quote, was meant to quote @WikiFM … 


WikiFM said:


> … I thought Zen was still way behind Intel in single threaded performance or *IPC*.





Smartcom


----------



## bug (Nov 12, 2018)

GlacierNine said:


> "95W+" is a bit misleading. Nobody should be looking at the 9900K and pretending it's simply a return to the hotter chips of yore. The fact is, it's actually a dramatically hotter chip than almost anything that has come before it, and the only reason we're able to tame it is because the coolers we use these days are so much more capable. At the time we were dealing with Intel Prescott chips, one of the best coolers you could buy was the Zalman CNPS9500. Noctua were only just about to release the *first* NH-U12. The undisputed king of the hill for air cooling was the Tuniq Tower 120, soon to be displaced by the original Thermalright Ultra 120.



That is completely wrong. 9900k is a 95W chip and will work within a 95W power envelope. It has potential to work faster when unconstrained, but it will work with a 95W heat sink. Old Pentium Ds were 130W chips and back then, Intel's guidance was only for average power draw, not maximal (kind of like those 95W mean today, though not exactly the same).
That said, there's no denying what Intel has now is redesign trying to fit more tricks into the current process node which should be long behind us. Thus, it's an architecture stretched past its intended lifetime.


----------



## Smartcom5 (Nov 12, 2018)

Valantar said:


> What @Smartcom5 is calling "IPS" is just actual performance (which occurs in the real world, and thus includes time as a factor, and thus also clock speed) and not the intentional abstraction that IPC is.


I'm sorry but I'm _not just_ 'calling' it as such, I just pointed out how things _are_ actually standardised. IPC, IPS and CPI in fact _are_ known and common figures, hence the wiki-links. But as you can see, the whole thing isn't as nearly as trivial as it might look to be.

That's why actual Performance is usually by default measured using the figure of the actually _absolute and fixed unit_ FLOPS (*F*loating *P*oint *O*perations *P*er *S*econd) or MIPS (*M*illion *I*nstructions *p*er *S*econd) – hence the performance of instructions per (clock-) cycle _while_ performing a processing of a _equally pre-defined_ kind of instruction (in this case, _floating-point numbers_).


Smartcom


----------



## HD64G (Nov 12, 2018)

Valantar said:


> You're phrasing this as if you're arguing against me, yet what you're saying is exactly what I'm saying. Sounds like you're replying to the wrong post. The image I co-opted came from the quoted post, I just sketched in how I believe they'll lay this out.


My mistake indeed and I edited my post to correct the misunderstanding. Cheers!


----------



## GlacierNine (Nov 12, 2018)

bug said:


> That is completely wrong. 9900k is a 95W chip and will work within a 95W power envelope. It has potential to work faster when unconstrained, but it will work with a 95W heat sink. Old Pentium Ds were 130W chips and back then, Intel's guidance was only for average power draw, not maximal (kind of like those 95W mean today, though not exactly the same).
> That said, there's no denying what Intel has now is redesign trying to fit more tricks into the current process node which should be long behind us. Thus, it's an architecture stretched past its intended lifetime.


Oh please, stop the apologism. The 9900K will work within a 95W power envelope, yes. At 3.6GHz base clock, with occasional jumps to higher speeds where the cooling solution's "thermal capacitance" can be leveraged.

But these chips and this silicon aren't designed to be 3.6GHz parts in daily use. They are ~4.7GHz parts that Intel reduced the base clocks on, in order to be able to claim a 95W TDP. If you had the choice between running a 7700K and a 9900K at base clocks, the 7700K would actually get you the better gaming performance in most games. Would you say that's Intel's intention? To create a market where a CPU 2 generations old, with half the cores, outperforms their current flagship in exactly the task Intel advertise the 9900K to perform? 

Or would you say that actually, Intel has transitioned from using boost clock as "This is extra performance if you can cool it", to using boost clock as the figure expected to sell the CPU, and therefore the figure most users expect to see in use?

You can clearly see this in the progression of the flagships, each generation.

6700K - 4.0GHz Base, 4 Cores, 95W TDP
7700K - 4.2GHz Base, 4 Cores, 95W TDP
8700K - 3.7GHz Base, 6 Cores, 95W TDP
9900K - 3.6GHz Base, 8 Cores, 95W TDP.

Oh well would you look at that - As soon as Intel started adding cores, they dropped the base clocks dramatically in order to keep their "95W TDP at base clocks" claim technically true. But look at the all core boost clocks:

4.0GHz, 4.4GHz, 4.3GHz, 4.7GHz

They dipped by 100MHz on the 8700K, to prevent a problem similar to the 7700K, which was known to spike in temperature even under adequate cooling, only to come back up on the 9900K, but this time with Solder TIM to prevent that from happening.

Single core is the same story - 4.2, 4.5, 4.7, 5.0. A constant increase in clockspeed each generation.

Like I said - Boost is no longer a boost. Boost has become the expected performance standard of Intel chips. Once you judge the chips on that basis, the 9900K reveals itself to be a power hungry monster that makes the hottest Prescott P4 chips look mild in comparison.


----------



## bug (Nov 12, 2018)

GlacierNine said:


> The 9900K will work within a 95W power envelope, yes. At 3.6GHz base clock, with occasional jumps to higher speeds where the cooling solution's "thermal capacitance" can be leveraged.
> Oh please, stop the apologism. These chips aren't 95W, 3.6GHz parts that Intel have magically made capable of overclocking themselves by 1.1GHz on all cores. They are ~4.7GHz parts that Intel reduced the base clocks on, in order to be able to claim a 95W TDP. If you could go back in time and cast a magic spell that
> 
> You can clearly see this in the progression of the flagships, each generation.
> ...


I'm not sure where you and I disagree. All these CPUs will work at 95W at their designated baseline clocks. With beefier heat sinks you can extract more performance. Nothing has changed, except the boost algorithms that have become smarter. Would you prefer a hard 95W limitation instead or what's your beef here?


----------



## efikkan (Nov 12, 2018)

It seems to me like this article is based on a bad translation referring to a 29% performance uplift (partially due to increased FPU width). For starters, to estimate IPC the clock speed would have to be at completely fixed (no boost). Secondly, in reality performance is not quite as simple as clock times "IPC", due to memory latency becoming a larger bottleneck with higher clocks.

A 29% IPC uplift would certainly be welcome, but keep in mind this is about twice the accumulated improvements from Sandy Bridge -> Skylake. I wonder how this thread would turn out if someone claimed Ice Lake would offer 29% IPC gains?
Let's not have another Vega Victory Dance™. We need to clam down this extreme hype and be realistic. Zen 2 is an evolved Zen, it will probably do tweaks and small improvements across the design, but it will not be a major improvement over Zen.


----------



## londiste (Nov 12, 2018)

efikkan said:


> It seems to me like this article is based on a bad translation referring to a 29% performance uplift (partially due to increased FPU width).


It actually isn't, the bad translation part I mean. This is from AMD themselves (see note 1):
http://ir.amd.com/news-releases/new...performance-datacenter-computing-next-horizon


----------



## R0H1T (Nov 12, 2018)

GlacierNine said:


> I disagree, for one very simple reason - *Tooling up production for 2 different physical products/dies would likely be more expensive than the material savings in not using as much silicon per product*. This stuff is not cheap to do, and in CPU manufacture, volume savings are almost always much more dramatic than design/material savings.
> 
> Serving Mainstream, HEDT, and Server customers from a single die integrated into multiple packages, is one of the main reasons AMD are in such good shape right now - Intel has to produce their Mainstream, LCC, HCC, and XCC dies and then bin and disable cores on all 4 of them for each market segment. AMD only has to produce and bin one die, to throw onto a variety of packages at *every level* of their product stack.
> 
> ...


The market (retail?) you're talking about is also huge, in fact bigger than *enterprise* even for Intel.
*If* the (extra) power savings materialize for ULP & ULV products then it makes sense to deploy a 4 core CCX over there, however an 8 core CCX will have better latencies & probably higher clocks as well.




bug said:


> I'm not sure where you and I disagree. All these CPUs will work at 95W at their designated baseline clocks. With beefier heat sinks you can extract more performance. Nothing has changed, except the boost algorithms that have become smarter. Would you prefer a hard 95W limitation instead or what's your beef here?


Fake 95W TDP?


----------



## GlacierNine (Nov 12, 2018)

bug said:


> I'm not sure where you and I disagree. All these CPUs will work at 95W at their designated baseline clocks. With beefier heat sinks you can extract more performance. Nothing has changed, except the boost algorithms that have become smarter. Would you prefer a hard 95W limitation instead or what's your beef here?


Smarter boost algorithms have absolutely nothing to do with this. Intel already had Speedstep to take care of dynamically downclocking the CPU to lower power states during low-intensity workloads In fact they've had it since 2005, so long that their trademark on Speedstep lapsed in 2012.

My 6700K has no trouble downclocking to save power when it's not necessary. The 3rd Gen i5 I'm typing this on has no trouble with that either. The Pentium 4 660 had it, I can find from a cursory google. In fact, support for the power-saving tech was originally mainly difficult not due to the platforms, but because of a lack of operating system support for the feature. "Smarter" power saving algorithms should have nothing to do with "Turbo boost" technology.

We disagree in that you think it is reasonable for Intel to consider a 9900K as "working according to spec" at 3.6GHz and "Overclocked" at 4.7GHz, when clearly these products are actually designed to run at higher clocks, and are expected to by consumers, and *will run* at higher clocks, it's just that it is only achievable at a *much* higher TDP than intel claims their CPU actually has. 

They can't have their cake and eat it - Either the 9900K is "The world's fastest gaming CPU (At 150W TDP)", or it is a 95W part (but isn't anywhere close to being the fastest gaming CPU at that TDP).

Intel should not be allowed to advertise this product as both of these mutually exclusive things.


----------



## GoldenX (Nov 12, 2018)

Buying AM4 was the best choice.


----------



## bug (Nov 12, 2018)

GlacierNine said:


> We disagree in that you think it is reasonable for Intel to consider a 9900K as "working according to spec" at 3.6GHz and "Overclocked" at 4.7GHz, when clearly these products are actually designed to run at higher clocks, and are expected to by consumers, and *will run* at higher clocks, it's just that it is only achievable at a *much* higher TDP than intel claims their CPU actually has.



Clearly? Are you sure about that?



GlacierNine said:


> They can't have their cake and eat it - Either the 9900K is "The world's fastest gaming CPU (At 150W TDP)", or it is a 95W part (but isn't anywhere close to being the fastest gaming CPU at that TDP).



There's no either/or here. It's both/and.
It used to be easy to say this CPU is better than that CPU when CPUs had a single core. It's become more complicated ever since.


----------



## Turmania (Nov 12, 2018)

I want at least 5ghz on all cores from new 2700x equivalent.15-20% ipc gain. Then I'm sold.Am I asking for too much? I do no think so considering Intel has not played die shrink hand yet but will do late in 2019.they probably will reach 5.5ghz plus.I want all AMD with cpu and gpu and use a free sync monitor.but AMD has to show me something for me to depart with my money.


----------



## GlacierNine (Nov 12, 2018)

bug said:


> Clearly? Are you sure about that?
> 
> 
> 
> ...


1 - Yes, yes I am sure about that.

2 - Intel wants to claim it can be the fastest gaming CPU while being 95W TDP. That's simply not true. It's 95W, or it's fast. One or the other.

It's not a 95W part at the same time as being the fastest gaming CPU.
It's not the fastest gaming CPU at 95W.

Why are you so insistent on defending their clear attempt to advertise a dichotomous product in a misleading way? What do you get out of refusing to admit that Intel's CPU draws as much power as it actually does?


----------



## qcmadness (Nov 12, 2018)

bug said:


> Clearly? Are you sure about that?
> 
> 
> 
> ...


https://www.anandtech.com/show/13544/why-intel-processors-draw-more-power-than-expected-tdp-turbo


----------



## Daven (Nov 12, 2018)

qcmadness said:


> https://www.anandtech.com/show/13544/why-intel-processors-draw-more-power-than-expected-tdp-turbo


Please read that Anandtech story if you would like to understand TDP ratings on Intel chips and how things actually work.

EDIT: It was posted a few days ago so perfect timing for this thread.


----------



## WikiFM (Nov 12, 2018)

Vayra86 said:


> Should have... would they be able to? A new node enables a new design I think and the compromises to do it on 14nm would kill the advantage anyway. 14nm is clearly pushed to the limit, and even over it for some parts if you look at their stock temps, (9th gen hi).
> 
> Eh... IPS in my mind is In Plane Switching for displays.
> He spelled it fine, you didn't read it right



AMD increased IPC with each iteration of Bulldozer, all of them 32/28 nm. So why couldn't Intel increase IPC in 14 nm?
About the temps, by increasing IPC Intel could reduce clocks and still have higher performance with lower temps.



Smartcom5 said:


> Gosh, I'm really sorry, was my bad! View attachment 110394
> Picked the wrong quote, was meant to quote @WikiFM …
> 
> Smartcom



I said single threaded or IPC not because I think they are the same, but because Intel beats AMD in both.


----------



## R0H1T (Nov 12, 2018)

WikiFM said:


> AMD increased IPC with each iteration of Bulldozer, all of them 28 nm. So why couldn't Intel increase IPC in 14 nm?
> About the temps, by increasing IPC Intel could reduce clocks and still have higher performance with lower temps.
> 
> 
> ...


You mean once on 32nm & twice on 28nm? As for Intel ~ remember* Tick Tock*, which is now Tick Tock Tick Tick Tick 

Lisa Su said that work on Zen2 began 4 years back, sometime after that they would've realized that their vision could only be fulfilled on 7nm. Like wise for Intel they've been working on Icelake for 4~6 years & even if one assumes it could theoretically be backported to 14nm+++ that simply wouldn't work without major compromises to the final design. Just an FYI retail chips are rumored to feature AVX512, which is simply not possible on this node. That* IPC gain* includes a hefty one time benefit from AVX512.


----------



## sideside (Nov 12, 2018)

randomUser said:


> This means, that:
> Zen1 will handle 1 instruction per 1 clock ...



Yeah but as usual whoever wrote this has no idea what IPC is, virtually everyone uses it incorrectly.


----------



## Vayra86 (Nov 12, 2018)

WikiFM said:


> AMD increased IPC with each iteration of Bulldozer, all of them 28 nm. So why couldn't Intel increase IPC in 14 nm?
> About the temps, by increasing IPC Intel could reduce clocks and still have higher performance with lower temps.
> 
> 
> ...



Because architecture isn't bound to a node but to a design. A viable architecture can scale as the nodes get smaller - Core and Zen represent such architectures.

The real story is told by IPC advances within the Core architecture and those are so massive, they are responsible for the lead AMD is still working to catch up on. Intel's main issue with Core is that they pulled out all the stops already with Sandy Bridge, and it was so strong, it remains relevant to this day. This is also why I among others say that Core has reached its peak and it needs radical changes to remain relevant. Its the same with GCN. Everything shows that it has been stretched to the max.

The story with Bulldozer is different and its remarkably similar to how they approach GPU up to today: as an iterative work in progress. You basically buy something that isn't fully optimized, and then you get to say 'ooh aah' every time AMD releases an update because performance increased. Unfortunately when the competition does go all the way, that means you end up with inferior product where optimization always trails reality.


----------



## Smartcom5 (Nov 12, 2018)

Valantar said:


> How I envision AMD's Zen2 roadmap:
> […]
> Of course, this is all pulled straight out of my rear end. Still, one is allowed to dream, no?


Can I beat that please? Since I dreamed about Intel being at least _somewhat_ competent all out of a sudden too! 



Intel could – and I want that being understood as my forecast for their _oh so_ awesome and totally revealing event on December 11th (hint: It won't be …) – they could help themselves quite a bit, if they just blatantly copy AMD's Fineglue™. Whereas I strongly guess, that they _will_ do exactly this;

How _I'm_ envision Intel's near future roud-map:

*Prediction:*
For the consumer-market, Intel copies AMD's (UMI's?) HBM-alike MCP-approach and starts to manufature CPUs being glued-together (hurr, hurr) from complete and rather unchanged unaltered common CPU-Dies using QPI UPI, like two dual-core and quad-core Dies on a single chip – pretty much what they're about to do now with Cascade Lake in the server-space. So a rather common approach on unaltered MCP-level, just whole Dies combined unaltered at PCB-level.

After that, Intel in their second coming copies AMD on technology-level (once again) in the direction of a 'clustered CPU' and starts with a modular assembly using chiplets of different manufatured node-sizes too, connected (optimally _hopefully)_ via their EMIB. That way they would be able to manufacture tiny core-chiplets which shall be reduced to e.g. only 4 cores per chiplet (or just 2 cores or even a single one). Such embodied pure-core chiplet or core-complex would be so tiny, that Intel could fab such chiplets hopefully even on their broken _totally working_ on track™ 10nm node.

That way, they wouldn't have to give up their ardently loved black hole called _iGPU_ too (or how I like to call it: »_Adventures of The Mighty Blue Electron facing Competition: The Road towards Iris-Graphics_«), while still bringing it in as a dedicated modular chiplet on e.g. 22nm 14nm.

So, tiny dedicated and independent core-complexes for the CPU-core part – let's call them CCi for now (*C*ore *C*omplex *I*ndependency) – while bringing in the rest of it on 14nm or even 22nm (given that their 28 nm process stopgap isn't still running yet …). All that as a modular Cluster-CPU put glued-together on actual Die-level as chiplets.

But seriously, ... that way, *a)* Intel could save their own ass over the time-span while they literally has nothing left (to lose) until they come up with their _a_ hopefully newly revamped architecture, *b)* use and thus save their disastrous 10nm-fiasco (without the obvious need to just write it off [since for anything more than a Dual-core that node's yields are evidently out for the count]) and they *c)* even would come down from their insanely expensive monolithic Die-approach while even saving huge manufacturing- and processing-costs and thus *d)* increase profits.

Sounds quite like a plan, doesn't it?! Actually like a real _epic_ masterplan I must say! I wonder why no-one else hasn't come up with such brilliancy yet?! … _oh, wait!_

Well, one can dream, can I? 


Anyway, I'm thrilled!


Smartcom


----------



## WikiFM (Nov 12, 2018)

R0H1T said:


> You mean once on 32nm & twice on 28nm? As for Intel ~ remember* Tick Tock*, which is now Tick Tock Tick Tick Tick
> 
> Lisa Su said that work on Zen2 began 4 years back, sometime after that they would've realized that their vision could only be fulfilled on 7nm. Like wise for Intel they've been working on Icelake for 4~6 years & even if one assumes it could theoretically be backported to 14nm+++ that simply wouldn't work without major compromises to the final design. Just an FYI retail chips are rumored to feature AVX512, which is simply not possible on this node. That* IPC gain* includes a hefty one time benefit from AVX512.



You are right 32 nm too(Vishera and Piledriver), plus 28 nm(Steamroller and Excavator), 4 iterations.
Skylake-X has AVX512 in 14 nm, so mainstream AVX512 in 14 nm can be possible.


----------



## bug (Nov 12, 2018)

qcmadness said:


> https://www.anandtech.com/show/13544/why-intel-processors-draw-more-power-than-expected-tdp-turbo


I have read that very article. It says the CPU is built to run at 95W by default, but that TDP can be adjusted to squeeze more juice out of it.
Maybe people have been unaware till now, but this a trick that has been employed for a while by both Intel and AMD. The only thing that changed is Intel decided to put fewer numbers on the box. The numbers were apparently easily accessible to the people that wrote that article, so it's not like Intel keeps them secret.


GlacierNine said:


> 1 - Yes, yes I am sure about that.



My apology, I didn't know you had the power to decide what's reasonable and what's not reasonable around here.



GlacierNine said:


> 2 - Intel wants to claim it can be the fastest gaming CPU while being 95W TDP. That's simply not true. It's 95W, or it's fast. One or the other.
> 
> It's not a 95W part at the same time as being the fastest gaming CPU.
> It's not the fastest gaming CPU at 95W.
> ...


I didn't realize you know what Intel wants to claim either.

To me this is extremely simple: people are stupid, you put more than one number on the box, they get confused. Intel realized that and decided not to put several TDPs on the box anymore.
For those genuinely curious about the platform and how to properly tweak it, all the info is right here: https://www.intel.com/content/www/u...core/8th-gen-core-family-datasheet-vol-1.html (search for PL2)


----------



## Smartcom5 (Nov 12, 2018)

Isn't Cannon Lake's infamous i3 8121U their first CPU within the mainstream space which features AVX-512 already?

*Edit:* @bug That TDP-classification of just 95W still is just deceptive …
Though that was without question the whole intention from the get-go when they started rating it that way, based on base-clocks.
It was an (working) approach to make look their chips more energy-efficiency while the efficiency of those chips didn't really changed at all.


Smartcom


----------



## Daven (Nov 12, 2018)

bug said:


> I have read that very article. It says the CPU is built to run at 95W by default, but that TDP can be adjusted to squeeze more juice out of it.
> Maybe people have been unaware till now, but this a trick that has been employed for a while by both Intel and AMD. The only thing that changed is Intel decided to put fewer numbers on the box. The numbers were apparently easily accessible to the people that wrote that article, so it's not like Intel keeps them secret.
> 
> 
> ...


From the Anandtech article:
"Over the last decade, while the use of the term TDP has not changed much, the way that its processors use a power budget has. The recent advent of six-core and eight-core consumer processors going north of 4.0 GHz means that we are seeing processors, with a heavy workload, go beyond that TDP value. In the past, we would see quad-core processors have a rating of 95W but only use 50W, even at full load with turbo applied. As we add on the cores, without changing the TDP on the box, something has to give. "

There has been a change from what was before.


----------



## R0H1T (Nov 12, 2018)

WikiFM said:


> You are right 32 nm too(Vishera and Piledriver), plus 28 nm(Steamroller and Excavator), 4 iterations.
> Skylake-X has AVX512 in 14 nm, so mainstream AVX512 in 14 nm can be possible.


SKL-X is huge, the cheapest variants cost what 8~10x the cost of the cheapest mainstream chip, not to mention the area dedicated towards AVX is also huge.
So no ICL, if it has AVX512, is not possible on any variant of 14nm.


----------



## bug (Nov 12, 2018)

Mark Little said:


> From the Anandtech article:
> "Over the last decade, while the use of the term TDP has not changed much, the way that its processors use a power budget has. The recent advent of six-core and eight-core consumer processors going north of 4.0 GHz means that we are seeing processors, with a heavy workload, go beyond that TDP value. In the past, we would see quad-core processors have a rating of 95W but only use 50W, even at full load with turbo applied. As we add on the cores, without changing the TDP on the box, something has to give. "
> 
> There has been a change from what was before.


Yes, I believe I have acknowledged that earlier. Depending on which manufacturer had the upper hand in power draw, they were quick to use the absolute power draw as TDP. And boast about how they are using the "right" metric. As soon as they lost that crown, they moved back to TDP meaning average power draw.

Keep in mind cTDP is not new. It has been with us since 2012 (actually introduced by AMD, not Intel), but we've been used to seeing it used the other way around, in laptops.


----------



## bubbleawsome (Nov 12, 2018)

TheGuruStud said:


> It appears to be a waste of materials to make anything less than 8 core to me.


Yes, an 8 core with 2 disabled is what I meant. It'll probably be a great budget chip.


----------



## Gasaraki (Nov 12, 2018)

GlacierNine said:


> Oh please, stop the apologism. The 9900K will work within a 95W power envelope, yes. At 3.6GHz base clock, with occasional jumps to higher speeds where the cooling solution's "thermal capacitance" can be leveraged.
> 
> But these chips and this silicon aren't designed to be 3.6GHz parts in daily use. They are ~4.7GHz parts that Intel reduced the base clocks on, in order to be able to claim a 95W TDP. If you had the choice between running a 7700K and a 9900K at base clocks, the 7700K would actually get you the better gaming performance in most games. Would you say that's Intel's intention? To create a market where a CPU 2 generations old, with half the cores, outperforms their current flagship in exactly the task Intel advertise the 9900K to perform?
> 
> ...



This is just so wrong.


----------



## Valantar (Nov 12, 2018)

GlacierNine said:


> I disagree, for one very simple reason - Tooling up production for 2 different physical products/dies would likely be more expensive than the material savings in not using as much silicon per product. This stuff is not cheap to do, and in CPU manufacture, volume savings are almost always much more dramatic than design/material savings.
> 
> Serving Mainstream, HEDT, and Server customers from a single die integrated into multiple packages, is one of the main reasons AMD are in such good shape right now - Intel has to produce their Mainstream, LCC, HCC, and XCC dies and then bin and disable cores on all 4 of them for each market segment. AMD only has to produce and bin one die, to throw onto a variety of packages at *every level* of their product stack.
> 
> ...


You're not wrong in terms of a single design being far cheaper, the only issue is that AMD has had two different dice for Ryzen since the launch of Raven Ridge. What I'm proposing is nothing more than adapting 2nd-gen APU dice to fit within the broader "MCM with I/O die" paradigm. There's going to be a low-end die for mobile no matter what - the sales volumes and need for power draw optimization in that market are too high for them to use disabled 8-core dice for the 15W mobile parts. They didn't for Zen or Zen+, and they won't for Zen2.

Using disabled 8-core dice for low-end 2c4t 15W mobile parts with near zero margin for $400 laptops will not fly, no matter what. If yields on 7nm are bad enough for this to be a viable solution, that's a significant problem, and if yields are good, they'd need to disable working silicon to sell as <$100 mobile parts with next-to-no margin. In the millions, as that's the numbers those markets operate in. That billion dollars would suddenly become inconsequential as they'd be wasting higher-grade dice to sell as low-end crap for no profit. It doesn't take much of this for a smaller, better suited design to become the cheaper solution.


----------



## Daven (Nov 12, 2018)

bug said:


> Yes, I believe I have acknowledged that earlier. Depending on which manufacturer had the upper hand in power draw, they were quick to use the absolute power draw as TDP. And boast about how they are using the "right" metric. As soon as they lost that crown, they moved back to TDP meaning average power draw.
> 
> Keep in mind cTDP is not new. It has been with us since 2012 (actually introduced by AMD, not Intel), but we've been used to seeing it used the other way around, in laptops.


So I guess what the previous poster and probably others are complaining about is that Intel didn't change the TDP spec on the box. They are finding themselves in a hard spot but pretending that everything is fine, magically doubling the core count while going higher on turbo at the exact same power level using virtually the same process node. However this is not the whole truth and different (key word here) than the past. As Anandtech states:
"So where do we go from here? I'd argue that Intel needs to put two power numbers on the box: 

TDP (Peak) for PL2 
TDP (Sustained) for PL1
This way Intel and other can rationalise a high peak power consumption (mostly), as well as the base frequency response that is guaranteed."



Gasaraki said:


> This is just so wrong.


Read the Anandtech article and then come back and see how you think about what's going on with Intel TDP numbers.
https://www.anandtech.com/show/13544/why-intel-processors-draw-more-power-than-expected-tdp-turbo


----------



## bug (Nov 12, 2018)

Mark Little said:


> So I guess what the previous poster and probably others are complaining about is that Intel didn't change the TDP spec on the box. They are finding themselves in a hard spot but pretending that everything is fine, magically doubling the core count while going higher on turbo at the exact same power level using virtually the same process node. However this is not the whole truth and different (key word here) than the past. As Anandtech states:
> "So where do we go from here? I'd argue that Intel needs to put two power numbers on the box:
> 
> TDP (Peak) for PL2
> ...


This is really not complicated. More cores will draw more power, there's no bending the laws of physics. However, if you lower the base clock, you will draw less current (power does not scale linear with frequency), thus your heat sink will be cooler. When the heat sink starts cooler, it can accommodate higher frequencies for a while, until it heats up.
Again, I see no trickery at work. Just a company finding a way to squeeze more cores on a production node they were planning to leave behind at least two years ago. Both Nvidia and AMD had to do something similar when TSMC failed with their 22nm node and everybody got stuck with 28nm for a couple more years than originally planned.



Smartcom5 said:


> Isn't Cannon Lake's infamous i3 8121U their first CPU within the mainstream space which features AVX-512 already?
> 
> *Edit:* @bug That TDP-classification of just 95W still is just deceptive …
> Though that was without question the whole intention from the get-go when they started rating it that way, based on base-clocks.
> ...


Well, I don't see it as deceptive, because 95W is all a board manufacturer has to support.
But when you start using words like "without question" to make your point, you're kind of preventing us further discussing this. Have a nice day.


----------



## WikiFM (Nov 12, 2018)

R0H1T said:


> SKL-X is huge, the cheapest variants cost what 8~10x the cost of the cheapest mainstream chip, not to mention the area dedicated towards AVX is also huge.
> So no ICL, if it has AVX512, is not possible on any variant of 14nm.





Smartcom5 said:


> Isn't Cannon Lake's infamous i3 8121U their first CPU within the mainstream space which features AVX-512 already?
> 
> Smartcom



So a cheap 8121U has AVX512 and also the $359 7800X 14 nm, so mainstream AVX512 is a reality that could have been widely spread by now.


----------



## looncraz (Nov 12, 2018)

Let me provide some perspective for the 29% IPC gain.

The test used, AFAICT, was a concurrent discrete kernel workload - running on an unknown dataset of an unknown size - and an RSA cryptography workload - unknown version, complexity, optimizations, etc...

The IPC values gives us some clue about how this was run.

First, dkern() frequently runs almost entirely within the FPU and performance is more a factor of branch prediction and getting results from the FPU back into an ALU branch pipeline.  It's actually a pretty decent generic test for the front end and FPU - not coincidentally the only two things AMD really talked about in regards to core improvements.

RSA is a heavy integer and floating point load.  It does pow() (exponents) and gcd() (greatest common denominator) operations, integer comparisons, type casts, and all manner of operation that usually hammers ALUs and FPUs in turn (rather than concurrently).  It uses different ALUs than dkern() and mostly can benefit from the same types of improvements - as well as the CPU recognizing, for example, the gcd() code pattern and optimizing it on the fly on multiple ALUs concurrently.

Together, this CPU was being hammered during testing by two workloads that do quite well with instruction level parallelism (ILP) - the magic behind IPC with x86.

We can't read anything more from these results other than Zen 2 is ~30% faster when doing mixed integer and floating point workloads.

However, that particular scenario is actually very common.  For games, specifically, we should see a large jump - mixed integer, branch, and floating point work loads with significant cross communication is exactly what the cores see in heavy gaming loads - Intel has won here because they have a unified scheduler, making it easier to get FPU results back to dependent instructions which will execute on an ALU (which might even be the same port on Intel...), it looks like AMD has aimed for superiority on this front.


----------



## Blueberries (Nov 12, 2018)

Zen 2 won't have anywhere near 29% more IPC, I'd have to be smoking some funny stuff to believe that again.


----------



## looncraz (Nov 12, 2018)

Blueberries said:


> Zen 2 won't have anywhere near 29% more IPC, I'd have to be smoking some funny stuff to believe that again.



On average? Absolutely not - that would be insane.  But for areas where Zen is currently weak?  Quite possible.

Even the scenario AMD showed had to have two concurrent loads to see 29%.  But that's another side of CPU performance many don't think about - loaded performance.

This should translate well into SMT scaling.


----------



## CheapMeat (Nov 13, 2018)

Most of you guys are writing as if you're buying EPYC.  I bet 99% of you are NOT.  The speculation here is all what has been presented for EPYC, especially the way the package is. There's no confirmation on what Ryzen will be exactly and how it'll be packaged. Lets just be honest here about what you are all actually buying and/or willing to spend money on.


----------



## Smartcom5 (Nov 13, 2018)

bug said:


> Well, I don't see it as deceptive, because 95W is all a board manufacturer has to support.
> But when you start using words like "without question" to make your point, you're kind of preventing us further discussing this. Have a nice day.


Well, I thought so …
If you put that CPU on a board which has vrm-phases designed to support such an CPU with a power-draw of _up to_ said 95W of total drawn power, the 9900K is a overpriced piece of hot garbage.
The pseudo-argument of that excuse that K-CPUs of such kind wouldn't be used on such boards is irrelevant as this CPU is _explicitly marketed_ with that 95W on purpose to trick people into believing exactly that, so that it actually draws up to said 95W (which isn't the case at all) – _not at stock_ nor on _any default_ BIOS/UEFI-settings boards get shipped with et cetera.

Though if you have any greater trouble figuring out the bare condition _if_ an marketing-campaign for a device, which is advertised with only 95W of power-draw – which it actually overdraws significantly 90% of the time it's active – shall be be deceitful or not, I don't know what to tell you. You don't seem to get the point at all – _either_ on purpose (which pretty much seems to be the actual case here, given your kind of arguing) _or_ due to a fundamental lack of moral understanding and ethical perception (which actually shall be considered being the required condition to defend such practices in the first place).

You too may have a nice day!


Smartcom


----------



## Arjai (Nov 13, 2018)

Intel-er's will be Intel-er's.

RyZen has been amazing, since day one. Unlike Intel, It HAS been getting better( kind of like how the Vega cards have incrementally, moved up the ranks). AMD, if anyone remembers, did kick Intel's butt. It has been a while but, disbelieving it could happen again? Simply, childish. 

I don't know the chain of command in Intel but, AMD? Lisa. Can anyone provide proof, that she is a liar?

I am going to buy stock in AMD, 2 days from now. See you when I am rich-er!


----------



## looncraz (Nov 13, 2018)

CheapMeat said:


> Most of you guys are writing as if you're buying EPYC.  I bet 99% of you are NOT.  The speculation here is all what has been presented for EPYC, especially the way the package is. There's no confirmation on what Ryzen will be exactly and how it'll be packaged. Lets just be honest here about what you are all actually buying and/or willing to spend money on.



I doubt many of us will buy EPYC 2, but that's why we're talking about the core itself - it's independent of what SKU we're talking about.

Ryzen 3000 could come with nothing smaller than a six core CPU (I hope AMD does this - Intel could barely adapt last time when quad core became the new low end mainstream CPU).


AMD gets 800~900 usable chiplets per 7nm wafer.  Assuming a $12k wafer cost (which is close) and a relatively high defect rate of 0.3/cm^2, that's just $14 per chiplet.  Two of those and a $10 IO die and you have a quite similar bill of material as Ryzen originally did on launch... except now AMD has 16 cores on the mainstream desktop and can happily ask $600 for it.  And I'd pay it.


----------



## HTC (Nov 13, 2018)

looncraz said:


> I doubt many of us will buy EPYC 2, but that's why we're talking about the core itself - it's independent of what SKU we're talking about.
> 
> Ryzen 3000 could come with nothing smaller than a six core CPU (I hope AMD does this - Intel could barely adapt last time when quad core became the new low end mainstream CPU).
> 
> ...



You're not taking into account the fact 7 nm process isn't mature yet, dude. Also, isn't the wafer size supposed to be 300 mm?

According to this:






I chose a *defect density of 0.4 because it's a new process*, and with these parameters, AMD would get 612 good dies. Note the die dimensions are rounded down (from 73 mm²) to make the calculations a tad bit easier.

Here's the link for the Die Per Wafer Calculator: https://caly-technologies.com/die-yield-calculator/


----------



## R0H1T (Nov 13, 2018)

We don't know the defect rate so any speculation wrt same is pointless, what's more relevant though is the price these chips command in the server space or high end retail. The price of dies is rather low, so that's not much of a problem anyway.


----------



## looncraz (Nov 13, 2018)

HTC said:


> You're not taking into account the fact 7 nm process isn't mature yet, dude. Also, isn't the wafer size supposed to be 300 mm?
> 
> According to this:
> 
> ...



Yes, 300mm wafer, 0.12mm scribe, 5mm edge loss.  The absense of a DFZ parameter is unfortunate (the stepback distance from a cut edge to avoid defects in the circuitry).

Also, when these things say "good" dies, they mean "perfect" dies.  A defective die may only have a bad core or some other minor, recoverable, fault, so "max dies" is the upper bound, so your numbers are 612~812 using a larger, square, die.   The shape plays a relatively minor role at this size, but it can play a more prominent role as the die grows, so keep that in mind, as the height of the chiplet is 1.4X greater than the width.

0.4 would be rather bad.  14nm had 0.08 at launch, I seriously doubt 7nm HPC has anything notably higher than 0.3 - and probably closer to 0.2 or even below.  I based my numbers on a range from 0.2 to 0.3 (which I probably should have stated in my comment).

6.75x9.45mm = 63.79mm^2 = 731~938 usable dies with a 0.4 defect density and 0.12 scribe h+V lanes.

Move to 0.3 defects/cm^2 and you have 827~938 usable dies.



R0H1T said:


> We don't know the defect rate so any speculation wrt same is pointless, what's more relevant though is the price these chips command in the server space or high end retail. The price of dies is rather low, so that's not much of a problem anyway.



Well, a $20 chiplet would dictate that Ryzen can only really afford to have one chiplet without having two designs - one with one chiplet and another with two.

$20 represents ~600 usable dies per wafer.  A silicon bill of materials in excess of $50 for Ryzen's core would be a big jump... and a big risk to bring to the mainstream market.

At $12 per chiplet - a price that will decline with time - AMD can toss two chiplets onto every Ryzen CPU, with a BoM pretty close to Ryzen's original costs.


----------



## R0H1T (Nov 13, 2018)

looncraz said:


> Well, a $20 chiplet would dictate that Ryzen can only really afford to have one chiplet without having two designs - one with one chiplet and another with two.
> 
> $20 represents ~600 usable dies per wafer.  A silicon bill of materials in excess of $50 for Ryzen's core would be a big jump... and a big risk to bring to the mainstream market.
> 
> At $12 per chiplet - a price that will decline with time - AMD can toss two chiplets onto every Ryzen CPU, with a BoM pretty close to Ryzen's original costs.


The headline number, for me, is the usable dies ~ which directly translates into lower/higher cost (per die) depending on defect rate. The number of usable dies is really important because it helps AMD keep up with their scheduled timelines, launch dates, demand & commitment especially towards large customers. It is said (by some) that AMD sold every chip they could produce in the Intel OEM bribing era, I can't say how true that is but AMD atm absolutely needs to fulfill the demand for Rome & compete for the growing needs of enterprise sector. Which is to say that the cost of dies is secondary, but again directly related to defect rate, right now while meeting obligations should be the primary goal, especially the _*Super *_7 plus one.


----------



## HTC (Nov 13, 2018)

looncraz said:


> *0.4 would be rather bad.  14nm had 0.08 at launch*, I seriously doubt 7nm HPC has anything notably higher than 0.3 - and probably closer to 0.2 or even below.  I based my numbers on a range from 0.2 to 0.3 (which I probably should have stated in my comment).
> 
> *6.75x9.45mm = 63.79mm^2* = 731~938 usable dies with a 0.4 defect density and *0.12 scribe h+V lanes*.
> 
> Move to 0.3 defects/cm^2 and you have 827~938 usable dies.



Are you referring to the 1st chip on 14 nm or to the 1st Zen chip on 14 nm? *IIRC*, when Zen was introduced, there were several chips being manufactured @ 14 nm, meaning the process was much more mature then 7 nm, where it is the 2nd chip (1st is Apple's A12 chip).

What is your source of Zen 2 CCX chiplet size? From what i've read, Zen 2 CCX chiplet measurement is roughly 73 mm² while yours is almost 10 mm² smaller.

For reference, i got those measurements from this post @ Anandtech forums.

When i made the pic in my previous reply, i was under the impression the *CCX chiplet size was 72 mm²* and that *the chiplet was a square instead of a rectangle*.

According to the die calculator page, those scribe values are invalid: either 0.1 or 0.15 but not 0.12.

Base on the current information, and with a defect density of 0.25, we get this (7.3 is also an invalid number for width so i improvised):


----------



## R0H1T (Nov 13, 2018)

HTC said:


> Are you referring to the 1st chip on 14 nm or to the 1st Zen chip on 14 nm? IIRC, when Zen was introduced, there were several chips being manufactured @ 14 nm, meaning the process was much more mature then 7 nm, where it is the 2nd chip (1st is Apple's A12 chip).
> 
> What is your source of Zen 2 CCX chiplet size? From what i've read, Zen 2 CCX chiplet measurement is roughly 73 mm² while yours is almost 10 mm² smaller.
> 
> ...


No Zen was the first high performance chip using GF 14nm, you could count Polaris but that's not exactly apples to apples.


----------



## HTC (Nov 13, 2018)

R0H1T said:


> No Zen was the first high performance chip using GF 14nm, you could count Polaris but that's not exactly apples to apples.



Only Polaris? I thought there were others: i'm probably miss remembering 

Correct me if i'm wrong but a process is independent from the chips, no? If so, then the experience from the Polaris chips helped with Zen by making the process more mature, thus making the defect density smaller, no?


----------



## GoldenX (Nov 13, 2018)

Blueberries said:


> Zen 2 won't have anywhere near 29% more IPC, I'd have to be smoking some funny stuff to believe that again.


Nobody believed that Zen1 was 59% better than FX.
I'm happy with a 15% increase plus clock speed increase.


----------



## Legacy-ZA (Nov 13, 2018)

If AMD keeps their prices very competitive like they are now, I will be upgrading to Zen 2.


----------



## Valantar (Nov 13, 2018)

WikiFM said:


> So a cheap 8121U has AVX512 and also the $359 7800X 14 nm, so mainstream AVX512 is a reality that could have been widely spread by now.


The 8121U has its AVX512 units (as well as pretty much everything else) disabled.


CheapMeat said:


> Most of you guys are writing as if you're buying EPYC.  I bet 99% of you are NOT.  The speculation here is all what has been presented for EPYC, especially the way the package is. There's no confirmation on what Ryzen will be exactly and how it'll be packaged. Lets just be honest here about what you are all actually buying and/or willing to spend money on.


None of us will be buying EPYC. That's why we're taking what they've said about it and attempting to extrapolate what this means for Ryzen 3000 and TR3. Also, it's interesting to discuss when someone makes some actual innovations in this space, even if we're not in the target market.


Legacy-ZA said:


> If AMD keeps their prices very competitive like they are now, I will be upgrading to Zen 2.


I'd actually consider the same, even if I'm very happy with my 1600X.


----------



## Aquinus (Nov 13, 2018)

WikiFM said:


> So what gives better yields then? Smaller dies at 7nm or a huge one at 14nm? Yes the I/O die is done in GloFo's 14 nm.





nemesis.ie said:


> @Aquinus It was confirmed at the NH event that the I/O chip is on 14nm.
> 
> My guess is that it could be from GF which keeps GF in the game.



Both actually. Smaller 7nm dies for the CCXs will help yields for a less mature process because smaller dies almost always translates to more usable dies. The larger I/O chip benefits from the maturity of the 14nm process which is likely to have better yields for larger dies which keeps costs down. This is actually the reason why Intel's PCHs are on a larger process than the node the CPUs are made on. It's really all about costs and yields because some components don't need to be on the smaller process.


Assimilator said:


> TBH I wouldn't call the 9900K "mainstream" due to its heat, price and availability. It's pretty clearly showing the limit of the Core uarch on 14nm, and I suspect that its successor will only show up once 10nm is fixed.


Sure but, it's still on a mainstream platform so I consider it mainstream even if it's the highest end of the MSDT market.


----------



## Daven (Nov 13, 2018)

bug said:


> This is really not complicated. More cores will draw more power, there's no bending the laws of physics. However, if you lower the base clock, you will draw less current (power does not scale linear with frequency), thus your heat sink will be cooler. When the heat sink starts cooler, it can accommodate higher frequencies for a while, until it heats up.
> Again, I see no trickery at work. Just a company finding a way to squeeze more cores on a production node they were planning to leave behind at least two years ago. Both Nvidia and AMD had to do something similar when TSMC failed with their 22nm node and everybody got stuck with 28nm for a couple more years than originally planned.
> 
> 
> ...


Anandtech does a great job with their Bench tool on their website. It helps with conversations like this one. Here are the Full package load power measurements for the last four Intel generations:
i7-6700K 82.55W
i7-7700K 95.14W
i7-8700K 150.91W
i9-9900K 168.48W
There is a major change between the 7th and 8th generations. However, Intel rates them all as 95W. You don't see a problem with this?
Source: https://www.anandtech.com/bench/CPU-2019/2194

EDIT: And if you look at all the CPUS at Full package load at that link, you will see almost all fall below or within +10% of the rated TDP across desktop, HEDT and server chips from both AMD and Intel. Only the 8700K and the 9900K are way off. This is deceptive advertising at its worst to try and look competitive and cover up being stuck on the same process node.


----------



## Valantar (Nov 13, 2018)

Mark Little said:


> Anandtech does a great job with their Bench tool on their website. It helps with conversations like this one. Here are the Full package load power measurements for the last four Intel generations:
> i7-6700K 82.55W
> i7-7700K 95.14W
> i7-8700K 150.91W
> ...


I'm not who you're talking to, but I agree with the conclusion of AT's recent look into this: they should start having two numbers, one "Base TDP" and one "all-core boost TDP". That'd clear up everything quite nicely. Base TDP would indicate minimum performance specs and power delivery requirements, and all-core boost TDP would indicate what your cooler and motherboard need to match to provide the best possible out-of-box experience.


----------



## Daven (Nov 13, 2018)

Valantar said:


> I'm not who you're talking to, but I agree with the conclusion of AT's recent look into this: they should start having two numbers, one "Base TDP" and one "all-core boost TDP". That'd clear up everything quite nicely. Base TDP would indicate minimum performance specs and power delivery requirements, and all-core boost TDP would indicate what your cooler and motherboard need to match to provide the best possible out-of-box experience.


At the top of each quoted section the thread coding lists the person you are replying too. In my case, I was replying to bug. Sorry for any confusion.


----------



## Valantar (Nov 13, 2018)

Mark Little said:


> At the top of each quoted section the thread coding lists the person you are replying too. In my case, I was replying to bug. Sorry for any confusion.


No confusion, I just chose to reply as I more or less agree with Bug's stance


----------



## londiste (Nov 13, 2018)

Regarding 9900K, GamerNexus's story is probably relevant here:
https://www.gamersnexus.net/guides/3389-intel-tdp-investigation-9900k-violating-turbo-duration-z390



Mark Little said:


> Anandtech does a great job with their Bench tool on their website.
> Source: https://www.anandtech.com/bench/CPU-2019/2194


About that doing a great job - What in the world is the full load, exactly?
Edit: after clicking through some of the the linked reviews Anandtech seems to be using POV-Ray.


----------



## bug (Nov 13, 2018)

Valantar said:


> I'm not who you're talking to, but I agree with the conclusion of AT's recent look into this: they should start having two numbers, one "Base TDP" and one "all-core boost TDP". That'd clear up everything quite nicely. Base TDP would indicate minimum performance specs and power delivery requirements, and all-core boost TDP would indicate what your cooler and motherboard need to match to provide the best possible out-of-box experience.


I wouldn't mind having two numbers on the box (though as I have written above, it will certainly confuse less informed users), but which numbers would Intel use? Because only the base TDP is mandatory, the other one is left to the motherboard vendor's will.


----------



## Valantar (Nov 13, 2018)

bug said:


> I wouldn't mind having two numbers on the box (though as I have written above, it will certainly confuse less informed users), but which numbers would Intel use? Because only the base TDP is mandatory, the other one is left to the motherboard vendor's will.


Which is exactly why you call one "base" (as in "base clocks", minimum in-spec performance) and one something else. This might be a bit confusing, but no more than people buying a chip with a shitty cooler and cheap motherboard, expecting matching performance from a review, yet getting 10-20% less. Which happens quite a lot. 

By making the second number official (determined by, say, the average all-core-boost power draw of the bottom 10% of chips in a specific SKU under a punishing load) Intel could make implementation uniform across motherboard vendors, with a simple "TDP" BIOS option, ("95W Base" for  restricted to stock (with short-term PL2 above this) "Performance" for slightly loosened but reasonable limits, and "Unrestricted" for balls-to-the-wall?). Mainly, the second number would serve as a guideline for buying a cooler and motherboard, and it could lead to motherboard makers labeling their VRM solutions with actual useful numbers instead of "X-phase". Ultimately this could lead to less confusion, as it actually serves to explain something complex to users instead of just trying to hush it up. Intel already allows for adjusting all of this in XTU (although a lot of motherboards ignore XTU power limit settings) so why not implement it across the board? Standardisation and enforcement of standards is a boon to users, not the opposite.


----------



## bug (Nov 13, 2018)

Valantar said:


> Which is exactly why you call one "base" (as in "base clocks", minimum in-spec performance) and one something else. This might be a bit confusing, but no more than people buying a chip with a shitty cooler and cheap motherboard, expecting matching performance from a review, yet getting 10-20% less. Which happens quite a lot.
> 
> By making the second number official (determined by, say, the average all-core-boost power draw of the bottom 10% of chips in a specific SKU under a punishing load) Intel could make implementation uniform across motherboard vendors, with a simple "TDP" BIOS option, ("95W Base" for  restricted to stock (with short-term PL2 above this) "Performance" for slightly loosened but reasonable limits, and "Unrestricted" for balls-to-the-wall?). Mainly, the second number would serve as a guideline for buying a cooler and motherboard, and it could lead to motherboard makers labeling their VRM solutions with actual useful numbers instead of "X-phase". Ultimately this could lead to less confusion, as it actually serves to explain something complex to users instead of just trying to hush it up. Intel already allows for adjusting all of this in XTU (although a lot of motherboards ignore XTU power limit settings) so why not implement it across the board? Standardisation and enforcement of standards is a boon to users, not the opposite.


Oh, gee, that's so simple to explain. Try saying that to the average buyer, see how it fares


----------



## looncraz (Nov 13, 2018)

HTC said:


> Are you referring to the 1st chip on 14 nm or to the 1st Zen chip on 14 nm? *IIRC*, when Zen was introduced, there were several chips being manufactured @ 14 nm, meaning the process was much more mature then 7 nm, where it is the 2nd chip (1st is Apple's A12 chip).
> 
> What is your source of Zen 2 CCX chiplet size? From what i've read, Zen 2 CCX chiplet measurement is roughly 73 mm² while yours is almost 10 mm² smaller.
> 
> ...




The range is from ~64mm^2 to ~72mm^2, which is why I also include a good range in my estimates, but I focused on the smaller size since I was using it to estimate the IO die size to see how much room was left over after everything we know had to be there was inside (a healthy 120mm^2 of die space with an unknown purpose...).

I'll redo the measurements using a better image of Rome.

The 4094 package is 58.5x75.4mm.  You can fit more than 10.5~10.75 width wise, giving a range from 7.01~7.18 for the width.  You can fit about 5.75~6.0 height wise, giving a range of 9.75-10.17.  The range is a necessity to correct for perspective (minor), pixelation (minor), and lack of detail for the edges (moderate).

...But this isn't the true die size (despite being the cut chip size) as far as the calculators are concerned....

Each die is surrounded by the cut edge, so each edge potentially has 0.05~0.15mm of extra material the die calculator removes since it's as good a way as any (the fact that some of the material becomes part of the cut chip doesn't concern the calculator - it knows that edge can't be used for anything) ... something that is usually immaterial for these calculations, but these things are pretty small, so it suddenly matters.  That's 0.1~0.3mm extra width and height that should be subtracted before placing into the calculator (or you can set the scribe size to zero, I suppose).

That gives a chiplet die size (as far as the calculator is concerned) of 6.71~7.08mm for width and 9.45~10.07 for height.  Which is 63.4 ~ 71.3mm^2, which pretty much everyone rounds to 64~72mm^2 since there's so much room for error.

At the smallest size, with a defect density of 0.3/cm^2 (more on that later), there are 772 perfect dies and 931 total candidates per wafer, with 82.9% yield.
At the largest size, with the same defect density rate, there are 669 perfect dies and 825 total candidates per wafer, with 81.1% yield.

Since there are 8 cores per chiplet, likely 16MiB of L3 taking up a good chunk of the die space, and so on, AMD will likely be able to use 95%+ of all chiplets made.  If half the cores or L3 is damaged, they can likely still salvage the die.  AMD achieved nearly perfect effective yields with 14nm right from the start because of their harvesting - I wouldn't expect them to change when moving to a much more expensive process... especially when pretty much betting the company's future on its success.

At 95% effective yield, the range is 783~884 chiplets per wafer.  A very minor adjustment to my original estimated range of 800~900.

--

Regarding defect density. 14nm LPE, this early in its life, had a defect density of less than 0.2/cm^2.  By the time production began, it was under 0.1/cm^2.  It was 0.08/cm^2 on Ryzen's launch and is now believed to be slightly lower.  No reason why TSMC can't manage the same with their 7nm processes.

A process will never make it to production with less than a 60% yield... unless they have some very high margin products for it...  IBM can charge so much that they can throw away 60% of a wafer.  AMD can't do that - they need 80%+ yields given the high price of 7nm.  And that includes Vega 20's yields - which is a much larger die on the same process, which gives us a hint about how low TSMC's 7nm defect rate probably is (0.2 or under would not be surprising - 0.1 would be exceptional at this point).  AMD's confidence in the process is telling.



bug said:


> Oh, gee, that's so simple to explain. Try saying that to the average buyer, see how it fares



AMD calls it cTDP.

You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard.  Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU).  AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.


----------



## efikkan (Nov 13, 2018)

looncraz said:


> Together, this CPU was being hammered during testing by two workloads that do quite well with instruction level parallelism (ILP) - the magic behind IPC with x86.
> 
> We can't read anything more from these results other than Zen 2 is ~30% faster when doing mixed integer and floating point workloads.


Doing one benchmark which is highly superscalar doesn't give us IPC, that gives us a best case performance for one type of workload.
Not to mention that the clock speeds are unknown. They have to be completely fixed to benchmark anything close to IPC.



looncraz said:


> However, that particular scenario is actually very common.  For games, specifically, we should see a large jump - mixed integer, branch, and floating point work loads with significant cross communication is exactly what the cores see in heavy gaming loads - Intel has won here because they have a unified scheduler, making it easier to get FPU results back to dependent instructions which will execute on an ALU (which might even be the same port on Intel...), it looks like AMD has aimed for superiority on this front.


Games are in fact one of the workloads that is the least superscalar and have the most branch and cache mispredictions. The reason why AMD scale well for certain superscalar workloads (like Blender and certain encoding and compression tasks) is that Zen have more "brute force" through ALUs/FPUs on execution ports, but fall short in gaming due a weaker front-end/prefetcher. Intel have few execution ports but achieve higher efficiency through better prediction and caching.


----------



## bug (Nov 13, 2018)

looncraz said:


> AMD calls it cTDP.
> 
> You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard.  Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU).  AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.



Yes, I've mentioned that briefly before. The thing is, any motherboard that can run the chip at full TDP can be restricted to cTDP. But that doesn't work the other way around. I don't expect all (and especially the cheaper) motherboards that can run CPUs at 95W to be able to also push in excess of 150W even momentarily. Imagine the outrage that would stem from Intel printing 150W on the CPU box when users find out they can't hit that with every motherboard. The alternative would be to mandate 150W+ support on all motherboards and drive prices up for everyone.

Bottom line, I just don't see a problem here. Clearly everyone into tech can figure out how to run these CPUs. And those who can't aren't probably spending the money on these. All I see here is a (typical by now) "omg! Intel did X, they're screwing end users!!!" reaction. When CPUs get _this_ complex, specs get _this_ complex too, that's all there is to it.


----------



## londiste (Nov 13, 2018)

What makes this hard for Intel and confusing for us is AVX (well, technically AVX2 and all the 256-bit stuff). That is about 40% power on top of what CPU uses without it.
From what I can see from reviews, without AVX 9900K actually does consume close enough to the rated 95W.

That is also something we can look forward to analyzing in Zen2.



looncraz said:


> AMD calls it cTDP.
> 
> You usually have 35W, 45W, 65W, and Unlimited options depending on the process and motherboard.  Most boards are now not restricting the 2400G, for example, so it pretty much always runs at 3.9GHz with full graphics power available... and pulls 100W (a good amount for a 65W APU).  AMD has specified that the boards should implement the power limiting, but few do because AMD doesn't really do a good job balancing between the power draw of the GPU and CCX, causing issues in games.


You would think it is that simple and straightforward especially with the perfect power management (at least on the package level) on Ryzens. I have a 2400G on that same Gigabyte board that most reviewers got. It used to have cTDP option in the BIOS but that went missing after a BIOS update. Similarly, the damn board actually had an MCE-like setting that indeed used to run my poor 2400G at 95-100W.


----------



## looncraz (Nov 13, 2018)

efikkan said:


> Doing one benchmark which is highly superscalar doesn't give us IPC, that gives us a best case performance for one type of workload.
> Not to mention that the clock speeds are unknown. They have to be completely fixed to benchmark anything close to IPC.



AMD gave actual IPC values (they have perfcounters, so they know exactly the IPC).  In a single workload, it's tough to get high IPC without SIMD.  Still, the type of workloads were designed to exploit integer to FPU communications and the ability to keep them fed.




efikkan said:


> Games are in fact one of the workloads that is the least superscalar and have the most branch and cache mispredictions. The reason why AMD scale well for certain superscalar workloads (like Blender and certain encoding and compression tasks) is that Zen have more "brute force" through ALUs/FPUs on execution ports, but fall short in gaming due a weaker front-end/prefetcher. Intel have few execution ports but achieve higher efficiency through better prediction and caching.



Superscalar isn't what I was talking about regarding games (my comment probably did make that a bit confusing, though) - it's the ability of the FPU to get results back to the ALUs that games need most.  AMD showed they can do that at least 29% better.  In all likelihood, it's probably exactly 33% higher and the 29% is a result of an ALU bottleneck (adds, subtractions, movs, etc. have likely not improved - but you can't really get those to be any better, anyway).

Ryzen uses the FPU as a coprocessor, so there's a good chunk of delay from when instructions are decoded and when they are executed... and dependent integer or memory operations are processed in parallel as far as possible. This isn't strictly superscalar simply because the execution steps become disjointed after classification as floating point, memory, or integer, and the front end generates multiple instruction streams from a single instruction stream.


----------



## efikkan (Nov 13, 2018)

looncraz said:


> AMD gave actual IPC values (they have perfcounters, so they know exactly the IPC).


They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.



looncraz said:


> it's the ability of the FPU to get results back to the ALUs that games need most.  AMD showed they can do that at least 29% better.  In all likelihood, it's probably exactly 33% higher and the 29% is a result of an ALU bottleneck (adds, subtractions, movs, etc. have likely not improved - but you can't really get those to be any better, anyway).


Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.


----------



## Valantar (Nov 13, 2018)

bug said:


> Oh, gee, that's so simple to explain. Try saying that to the average buyer, see how it fares


"This number," *points at label showing 95W/3.6GHz* "is baseline performance, what you get with a cheap mobo and cooler. This other number," *points to where it says 180W/4.7GHz* "is what you can reach if you invest in better cooling and a more solidly built motherboard. It's around 30% faster on paper, though YMMV."

That wasn't so hard, now was it?


----------



## R0H1T (Nov 13, 2018)

efikkan said:


> *They gave results from a cherry-picked benchmark*, that does not *equivalate IPC*. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, *we should have the same standard for AMD*.
> 
> 
> Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.


That's because it's nearly impossible to replicate test conditions from one test bench to another, one chip to another. AMD can't lie to their investors, their claims are true even if cherry picked. Yes ~ let's assume best case, or best of 3 (runs) but what you seem to be doing here is Intel/Nivdia (AMD?) lied in the past so this is also a lie.


----------



## looncraz (Nov 13, 2018)

efikkan said:


> They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.



No doubt it was cherry picked - there's a reason those tests specifically leveraged the hardware AMD mentioned as being improved.  This is the peak improvement (outside of 256 bit floating point, which should roughly double in performance) we should expect.  It's the upper bounds... but it's still a valid test because we can gleam from it several things.

dkern can be extremely useful for testing the front end... but you need to have a rather large vector upon which to operate for that to happen... and we just don't know what AMD was doing with dkern since it's just a function that operates on data.  However, dkern has branches, does integer or floating point comparisons, decrements, subtraction, division, and operates on potentially large amounts of data (usually image or scientific data, being a statistical smoothing method).

RSA in this situation could be used to decrypt or encrypt the data being accessed or could be an entirely other program being used... or AMD ran two benchmarks and averaged the results... they were very unclear.

RSA has a few tight loops, multiplication, division, comparisons and branches within loops, and potentially significant bandwidth utilization (simply jumping to L2 counts as significant in this context).



efikkan said:


> Games are in no way bottlenecked by ALU or FPU throughput. Rendering threads in games have very little computational load (percentage of total instructions), and is mainly bottlenecked by cache misses. To even come close to saturating ALUs and FPUs, like the benchmark referred to in this thread, you need the majority of the code to be a tight loop of nearly pure math operations. On modern ~4 GHz CPUs the cost of a cache miss is ~250 cycles or more, and with all the execution ports you should be able to imagine how much potential is actually wasted by a single cache miss, or how many simple ALU/FPU operations one cache miss is "worth". Game rendering code is usually a huge amount of function calls and a small amount of calculations (relatively speaking), most of these function calls will cause at least one cache miss, which is why this type of code usually leaves the CPU stalled >95% of clock cycles. A small improvement in prediction and/or OoOE can give decent benefits without adding any additional computational resources.



They are usually bottlenecked by ALU->FPU communication or, as you say, cache miss penalties.   Ryzen's sequential cache performance is very good, but it falls flat with random accesses, so that is definitely a significant role - and hopefully something AMD has resolved with Zen 2.  A 33% improvement in ALU->FPU throughput means ~10% improvement for many CPU bottlenecked games per cycle.  That puts them roughly on par with Intel for those games.  Others that are thrashing the cache (which would be a bad game engine - of which there are plenty (I'm looking at you Hitman!)) won't care at all about that improvement (or very little... or even "dislike" it).  Here, of course, Zen 2 will need to have reduced semi-random access latencies.  It doesn't help that each core advertises access to 8MiB of L3 but only seems to search 4MiB worth of tags before jumping to the IMC.  An L4 would help here - we wouldn't be hitting memory latencies for in-page random access, in the very least, but would be much closer to Intel's ~20ns figures.


----------



## Valantar (Nov 13, 2018)

efikkan said:


> They gave results from a cherry-picked benchmark, that does not equivalate IPC. This is all marketing BS, and we should know better than that. People are usually very (and rightfully) sceptical of Intel's and Nvidia's cherry-picked performance claims, we should have the same standard for AMD.


Calling something solely mentioned in an endnote and not used as a marketing point whatsoever "marketing BS" is... a stretch. Yes, we should hold AMD to the same standard as Nvidia and Intel, but this isn't even close to the crap they've pulled earlier (or AMD, for that matter). This is "presenting" (well, not really, more like "not quite omitting") very specific numbers from a very specific benchmark in a very specific context, and not making a fuss about it. AMD hasn't at any point said "Zen2 has 29% improved IPC from Zen." If they did, that would be marketing BS. And, again:


Valantar said:


> We need to remember that IPC is workload dependent. There might be a 29% increase in IPC in certain workloads, but generally, when we talk about IPC it is _average_ IPC across a wide selection of workloads. This also applies when running test suites like SPEC or GeekBench, as they run a wide variety of tests stressing various parts of the core. What AMD has "presented" (it was in a footnote, it's not like they're using this for marketing) is from two specific workloads. This means that a) this can very likely be true, particularly if the workloads are FP-heavy, and b) this is very likely not representative of total average IPC across most end-user-relevant test suites. In other words, this can be both true (in the specific scenarios in question) and misleading (if read as "average IPC over a broad range of workloads").


Given that AMD hasn't pushed this as a selling point, they haven't done anything even remotely wrong. They can't be blamed for inept journalists and/or fanboys taking specific statements out of context.


----------



## efikkan (Nov 13, 2018)

R0H1T said:


> That's because it's nearly impossible to replicate test conditions from one test bench to another, one chip to another. AMD can't lie to their investors, their claims are true even if cherry picked. Yes ~ let's assume best case, or best of 3 (runs) but what you seem to be doing here is Intel/Nivdia (AMD?) lied in the past so this is also a lie.


You know very well what I'm talking about; all the vendors choose benchmarks which favors them at any time, which puts their product in the best possible position.



looncraz said:


> They are usually bottlenecked by ALU->FPU communication or, as you say, cache miss penalties.


ALU->FPU communication? What do you mean by that? Conversion of ints to floats?



looncraz said:


> Ryzen's sequential cache performance is very good, but it falls flat with random accesses, so that is definitely a significant role - and hopefully something AMD has resolved with Zen 2.


You do know how cache works, right? In sequential reads cache should be "transparent".
Cache is organized in banks. Zen's 8-way 512kB L2 is actually 8 separate 64kB caches (Skylake have 4-way 256kB, Haswell 8-way 256kB). Memory is stored in 64b cache lines, for sequential reads the cache lines will be evenly spread across the banks. Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.

Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.

What does random accesses have to do with this? Nothing can ever predict random accesses, they will fall through the cache and read directly from memory. The only thing that can marginally help here is the OoOE trying to dereference a pointer etc. as early as possible, but usually the limits to how far ahead the prefetcher can "see", and of course all branching logic and other pointers may limit the room for early execution here. Once again this has to do with the efficiency of the prediction, not the latency of the cache.



looncraz said:


> It doesn't help that each core advertises access to 8MiB of L3 but only seems to search 4MiB worth of tags before jumping to the IMC.  An L4 would help here - we wouldn't be hitting memory latencies for in-page random access, in the very least, but would be much closer to Intel's ~20ns figures.


You mean that each four cores shares one L3 cache?
L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores. And when it is, it's mostly code, not data. When it comes to writing, the CPU engages a write-lock to discard a cache line from all caches, any latency here comes down to the entire memory structure, not the L3.
I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.
I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.


----------



## looncraz (Nov 14, 2018)

efikkan said:


> ALU->FPU communication? What do you mean by that? Conversion of ints to floats?



Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.



efikkan said:


> In sequential reads cache should be "transparent".



Ideally, yes, you should never have a stall with streaming data... but there's a difference between operating with data right off a data bus, within the register file, hitting the 1ns latency of L1D, or hitting the 3~4ns latency of the L2.  On Zen, at least, there's no real bandwidth penalty for hitting the L2.



efikkan said:


> Zen having 8×64kB L2 caches vs. Skylake's 4x64kB caches should not give Zen any disadvantage in latency or throughput.



That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.









efikkan said:


> Intel's advantage isn't a faster cache, it's a better front-end/prefetcher to detect linear accesses which improves cache hit ratio.



Most of Intel's front end isn't necessarily better than Zen's (just different).  Remember that the 6900k has a 20MiB L3.  Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory.  Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.




efikkan said:


> You mean that each four cores shares one L3 cache?
> L3 cache is largely a "spillover cache", cache lines which have been recently used but kicked out of L2. Even in heavy multithreaded workloads, very little L3 is ever shared among cores.



Data sharing is insanely common in multi-threaded programs.  If it wasn't we wouldn't have to worry so much about lock contention.  I have a good ~20 years of MT programming experience (from way back in the BeOS 4.5 era) with numerous programming languages, operating systems, and devices.  I directly tested Zen's inter-core and inter-CCX communications to discover that there was a fast-path (low latency, low bandwidth) communication path for small data packets before AMD detailed the command fabric.  I discovered it by accident because I couldn't explain how I was getting data between Core 0 to Core7 (on different CCXes) with only something like a 20ns penalty versus going from core 0 to core 1 (same CCX, neighboring cores).... but I digress...



efikkan said:


> I also want to remind you that Intel switched their memory structure in Skylake-X/-SP vs. Skylake, making L2 larger and L3 smaller, but making L3 exclusive, and they improved overall efficiency.



Yes, they copied Zen (kind of a joke...).  Though AMD uses a 'mostly' exclusive design - though we don't fully know, AFAIK, what they mean by that.




efikkan said:


> I don't see any evidence to support that Zen is disadvantaged from having 4MB L3 per core. If they add an L4, it will basically be a larger "spillover cache", and they would also have to be careful not to increase overall latency by the added complexity.



It's not the 4MiB per core - it's what happens in many scenarios when a core tries to access beyond that basic block...


----------



## londiste (Nov 14, 2018)

looncraz said:


> Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory.  Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.


I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.


----------



## HTC (Nov 14, 2018)

londiste said:


> I would argue Ryzen gets a jump in the game performance due to *increased inter-die communication speed, not the lower memory latency.*
> On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.



More likely, a combination of both.


----------



## R0H1T (Nov 14, 2018)

efikkan said:


> *You know very well what I'm talking about; all the vendors choose benchmarks which favors them at any time, which puts their product in the best possible position.*
> 
> 
> ALU->FPU communication? What do you mean by that? Conversion of ints to floats?
> ...


And that's fine so long as the benchmarks aren't outright fudged or in some cases the competition's system deliberately gimped. Best case implies just that & should always be taken with a grain of salt.

What you're forgetting though is that without these "best case" benches there would be little to no IPC gain, for Intel, in the last few years. Take the FP numbers out, heavily influenced by AVX2 or AVX512, & you have virtually 0 IPC gains for close to 4 years, if not more. That's because* x86 has pretty much reached the end of the line* so far as IPC gains are concerned. The biggest performance gains this decade have come from tweaking cache hierarchy, DDR4, AVX & clock speeds. *That's not the case for ARM* but it's also not a part of this debate.


----------



## londiste (Nov 14, 2018)

ARM is not magic. The gains ARM has done have already been there for a long time in x86 and other architectures (VFP, NEON, SVE, 64-bit, multiple cores, out-of-order, prediction). In addition to that due to it being small(er), simple(r) and cheap(er) ARM has had a process node advantage for a couple generations now. As ARM improves, it will start running into similar problems as other architectures.


----------



## WikiFM (Nov 14, 2018)

Mark Little said:


> Anandtech does a great job with their Bench tool on their website. It helps with conversations like this one. Here are the Full package load power measurements for the last four Intel generations:
> i7-6700K 82.55W
> i7-7700K 95.14W
> i7-8700K 150.91W
> ...



In that same chart the 8086K which is a tiny bit faster than 8700K consumes 100W, why is that?



Valantar said:


> The 8121U has its AVX512 units (as well as pretty much everything else) disabled.
> 
> None of us will be buying EPYC. That's why we're taking what they've said about it and attempting to extrapolate what this means for Ryzen 3000 and TR3. Also, it's interesting to discuss when someone makes some actual innovations in this space, even if we're not in the target market.
> 
> I'd actually consider the same, even if I'm very happy with my 1600X.



8121U is not disabled: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3_20-GHz. And even if it was the case it has them.


----------



## R0H1T (Nov 14, 2018)

londiste said:


> *ARM is not magic*. The gains ARM has done have already been there for a long time in x86 and other architectures (VFP, NEON, SVE, 64-bit, multiple cores, out-of-order, prediction). In addition to that due to it being small(er), simple(r) and cheap(er) ARM has had a process node advantage for a couple generations now. As ARM improves, it will start running into similar problems as other architectures.


Alright & where did I say that? The reason ARM is doing better than x86 atm is because x86 has been tweaked & improved over 4 decades, the gains decade on decade have been massive. ARM is only realizing these (massive) gains since 2k, so they may still have a decade or half to hit the wall which x86 seems to be running into right now. They could, of course hit the physics wall first.

Why do I see this dismissive tone when talking about ARM here, do you buy phones with Intel Inside? Why do you think that is, do you treat (one of) AMD vs Nvidia the same way?


----------



## londiste (Nov 14, 2018)

R0H1T said:


> Why do I see this dismissive tone when talking about ARM here, do you buy phones with Intel Inside?


I apologize, this wasn't so much about your post but the apparent general impression that ARM is something completely different and has long way to go. It doesn't.

The other part is why are you dismissive about x86? There have been some tests on single core performance here and there from Sandy Bridge forward (7 years). For example:
https://m.sweclockers.com/test/23426-amd-ryzen-7-1800x-och-7-1700x/29
23% in Cinebench is not too bad. And Cinebench does only 128-bit AVX (which is from Sandy Bridge/Bulldozer era).

Intel has been stuck on Skylake and derivatives for 3 years but I would not necessarily put this down to inability to improve IPC. Core count is the clear focus for the last few years.


----------



## Valantar (Nov 14, 2018)

WikiFM said:


> 8121U is not disabled: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3_20-GHz. And even if it was the case it has them.


I stand corrected. I guess _something _on that chip had to avoid getting cut, lord knows everything else is.


----------



## R0H1T (Nov 14, 2018)

londiste said:


> I apologize, this wasn't so much about your post but the apparent general impression that ARM is something completely different and has long way to go. It doesn't.
> 
> *The other part is why are you dismissive about x86*? There have been some tests on single core performance here and there from Sandy Bridge forward (7 years). For example:
> https://m.sweclockers.com/test/23426-amd-ryzen-7-1800x-och-7-1700x/29
> ...


So far as raw performance is concerned, no x86 is still the team (Intel & AMD) to beat. Howsoever, as I've noted, IPC gains outside of AVX assisted (FP) workload have been hard to come by. Can you deny that?

I've also noted that the biggest changes have been in cache, memory, clock speeds & arguably HT or SMT for AMD. Admittedly ARM also benefits from that, but again the point is ARM are coming from a much smaller base (number) & so their gains are incredible. The biggest servers, supercomputers will still be vastly x86 based, but as you've said that's down to more cores. IMO (chip) interconnect technologies like UPI or IF are the next & perhaps the last hurdle before x86 reaches it's peak. I don't see the same kind of progress in the next decade, as we've seen since 2010 unless there's some major breakthrough. The future is dedicated (hardware) accelerators, that is where I see computing realm headed. The core wars have just begun but even there physics will catch up pretty soon.


----------



## bucketface (Nov 14, 2018)

This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.


----------



## londiste (Nov 14, 2018)

R0H1T said:


> I've also noted that the biggest changes have been in cache, memory, clock speeds & arguably HT or SMT for AMD.


Caches are integral part of any contemporary CPU.
Memory speeds have increased, yes. Well, latency not so much but bandwidth for sure. Memory improvement will continue. DDR5 is on its way and for some implementations, GDDR/HBM with their respective up- and downsides. How much this affects results depends on benchmark. In case of Cinebench, it has a very low scaling with faster memory.
The Cinebench test I linked is at the same clock speeds and single core/thread so no HT/SMP in play.



R0H1T said:


> So far as raw performance is concerned, no x86 is still the team (Intel & AMD) to beat. Howsoever, as I've noted, IPC gains outside of AVX assisted (FP) workload have been hard to come by. Can you deny that?


I do not know if I would want to separate FP from general CPU performance. FP has been a part of x86 for a long time - in coprocessors since the beginning, integrated in and Pentium onwards. Improving parts of the instruction set is part of CPU evolution.



R0H1T said:


> Admittedly ARM also benefits from that, but again the point is ARM are coming from a much smaller base (number) & so their gains are incredible.


Comparisons with ARM are difficult. In addition to integrating performance-improving aspects (many of them tried-and-true) ARM has been moving to scale its architecture(s) higher and higher, largely funded and driven by smartphones. Higher-performing ARM CPUs are not small and they do consume considerable amounts of power.

Actually, since you mentioned AVX and other instruction-level improvements being a bit suspect when comparing IPC, ARM has gained a lot of its IPC and almost all of its FP performance from just that. Since ARM's focus is different, they are also going for more cores rather than more performance, especially in the last few years. They also benefit from not having to be compatible 

Rest of your post I wholeheartely agree with.


----------



## looncraz (Nov 14, 2018)

londiste said:


> I would argue Ryzen gets a jump in the game performance due to increased inter-die communication speed, not the lower memory latency.
> On Intel CPUs there are games where faster memory benefits but this is far from common and in many cases memory speed makes a negligible difference. The same does not apply to Ryzens, these will get a jump from faster memory across the board.



Ryzen basically double-dips when you have higher memory clocks.  Intel gains, Zen gains double.


----------



## bug (Nov 14, 2018)

If it seems too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.


----------



## looncraz (Nov 14, 2018)

bucketface said:


> This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.




It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2.  It tells us what the core can do on its own in a roughly ideal situation to extract IPC.  Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.

From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:

Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)

These values would average together to become the IPC increase for this particular workload.  These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.

___________________

Also, x86 has plenty of room for improvement.  We just have to start walking away from relative energy efficiency.

If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC.  Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.

Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance.  This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle.  Still, this would be something like a 50% IPC increase.


----------



## Valantar (Nov 14, 2018)

bug said:


> If it's too good to be true, it usually is.
> AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.


Clarified? They have never claimed otherwise.


----------



## bug (Nov 14, 2018)

Valantar said:


> Clarified? They have never claimed otherwise.


Maybe "put an end to speculations" would have been clearer?


----------



## GlacierNine (Nov 14, 2018)

bug said:


> If it's too good to be true, it usually is.
> AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.


I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is". 

Congratulations on your newfound grasp of this tautology.


----------



## bug (Nov 14, 2018)

GlacierNine said:


> I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".
> 
> Congratulations on your newfound grasp of this tautology.


Yeah, well, posting in a hurry between two compiles will do that 

Edit: fixed


----------



## looncraz (Nov 14, 2018)

bug said:


> Maybe "put an end to speculations" would have been clearer?



They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.


----------



## bug (Nov 14, 2018)

looncraz said:


> They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.
> 
> It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.


At least it puts an upper bound on expectations, so I'm good.


----------



## looncraz (Nov 14, 2018)

bug said:


> At least it puts an upper bound on expectations, so I'm good.



That it does - it looks around 30% for any real world task is going to be max.  It's kind of a way to temper the whole "we doubled the FPU" thing, I think.

Much better for headlines to read 29% improvement rather than 100% improvement.. especially when some programs will see 5% benefit and others 20%, with a few as high as 30%.

Sadly, it's hard to predict Cinebench scores for this since Cinebench relies much more heavily on branch prediction and prefetch, but we can guess it will be at least 10%, but no more than 30% - and very very unlikely to see 30%.  But that's where we were before, except with a lower upper bound.  I still think it will be about 15% on average.


----------



## efikkan (Nov 14, 2018)

looncraz said:


> Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...
> 
> It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.


You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
Let's say you write A + B + C + D in your code,
this will be executed as ((A + B) + C) + D,
and while each addition should only take a few cycles, the CPU would have to wait up to 18 cycles before the next operation can even start. While the exact timing and bandwidth slightly vary between CPU architectures, this principle should largely be the same for any pipelined architecture.



looncraz said:


> That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.


Judging by that image, there is no issue with L3 cache at all.



looncraz said:


> Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...


Sure, the memory latency can be much worse due to Zen's core structure. But you were the one arguing for Zen needing better cache. I'm the one pointing out that Zen have a decent cache on paper, but Intel is much better at utilizing their cache through a better front-end.



looncraz said:


> Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory.  Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.


This has to do with AMD's Infinity Fabric being tied to the memory speed.
Intel see no significant difference with speeds above 2666 MHz, even memory with slightly better timings have virtually no impact.



looncraz said:


> Data sharing is insanely common in multi-threaded programs.


Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.


----------



## looncraz (Nov 14, 2018)

efikkan said:


> You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?



It actually doesn't, though that used to be the case. Today, fetch and decode chunks of instructions and then determine dependencies.  We tag instructions as dependent upon others and try to reroute non-dependent instructions around them before then processing the dependent instruction.

For most instructions, the core knows (within reason) how long it should take to execute and get the result back, so we don't wait for the result - we schedule the instruction so that it is already ready to be fed into a pipeline when the result is available... we want the decoded instruction tagged and sitting in the scheduler - and we want the instruction whose result we need to carry a matching tag.

An add instruction takes a single cycle, but it takes time to get the result.  Intel uses a bus they call, unimaginatively, but accurately, the "result bus."  This bus is fed by the store pipelines and each execution pipeline.  The load and execution pipelines can read results directly off this bus if the timing works out correctly.

So, A + B + C + D would execute as mov A, result; add B, result; add C, result; add D, result;.

One trick here, which is by no means obvious or necessarily done, would be to keep the instructions in two places.  You keep a copy in the scheduler, tagged, as you send the dependent instruction down the pipelines one after the other, so the next instruction can get its result from the previous instruction from the result bus.

The way Intel describes what they do (despite admitting that the execution pipelines can read from the result bus) is to send the result back to the scheduler, where the dependent instruction is waiting for the data.  I genuinely suspect they do both (otherwise there's no need for an execution pipeline to read from the result bus..)... it just depends on how dependably the execution while occur within a given time frame.



efikkan said:


> Judging by that image, there is no issue with L3 cache at all.



That image is using a 256-byte stride which hides the full effect of the random-access issue until it exceeds about the cache size as Ryzen can predict the access pattern well enough.






You can see the (unsurprising) excellent sequential performance (which is only 2.8ns on my 2700X) ... and them the abysmal random access performance.

Intel's in-page random access performance is several times better.  This is the cache performance issue that is hurting Ryzen - and it relates to how often the L3 prefetch ends up hitting the IMC instead of being able to stay within the L3.  This happens increasingly more after a single core uses more than 4MiB of data.  By 6MiB you have a ~50% miss rate that result in hitting memory latency.

My 2700X results are better - because my IMC latency is only 61.9ns with 3600MT/s memory.

If Zen 2 can bring that down by another 20ns while increasing how much L3 each core can access, it's going to be a big deal.  My 2700X only has 9.5ns latency to the L3 - if I had 40ns latency to main memory and 16MiB of cache to access, in-page random access should fall to the 20~30ns region (depending on page size).



efikkan said:


> Intel is much better at utilizing their cache through a better front-end.



Zen's front end is extremely good.  As is Intel's.

Zen's has higher throughput potential (8 uops vs 4 uops), but Intel has fusion - so that 4 uops is sometimes 7 uops... and Zen's 8 uops is sometimes 4 uops...
Intel's branch predictor seems to be better, but that's about it.

The first Zen bottleneck (if you want to call it that) is when Intel can dispatch 7 uops and Zen can only dispatch 6.  Intel isn't always able to dispatch 7 uops, but Zen can never exceed 6.  That's a potential 16.7% advantage to Intel.

The next Intel advantage is in their unified scheduler - which allows accessing results without going back to the scheduler.  Zen, AFAICT, needs to send results back to the forwarding muxes or the PRF.  This is only a couple cycles - and AMD makes up for it by having four ALU full featured pipelines and 6 independent schedulers.  Being only 14 deep keeps things simple to manage, but it may mean results could need to be fetched from the L1D (3 cycle penalty) more often.



efikkan said:


> Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.



If you mean between L3 caches to mean between each CCX or die - yes, that's true.  Everything is always referenced as a memory address, the LLC acts as the insulator to main memory.  However, cross communication does occur for certain volatile memory.  This seems to happen via the command bus, but it also happens via the data fabric.  This is probably magic that happens through the IMC without going to system memory, which would explain the latency results with core to core communication (simple test - fixed affinity, each core accessing the same memory addresses, just reading and updating a simple struct - time, accessing core, and a mutex... each core that gains the mutex records the time difference between the last access, which core made that access, updates the struct, and moves on).  This showed that handling the mutex could, PEAK, take only 10ns between the CCXes (this could even be a timing mechanism inaccuracy, since this all reanalysis), but usually took way more... strong clusters at 20ns, 40~50ns, and a good half at 100ns or more (which means out to main memory).

Multi-threaded apps share data across cores, it's as simple as that, and mutexes and volatile memory are all something the CPU can figure out with ease, so optimizing for those, in very least, has been done.



efikkan said:


> Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.



Yes, it will be extremely interesting to see how the newly unified IMC that's spread far across the IO die will work to solve some of these issues.


----------



## Vario (Nov 15, 2018)

> '_The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications,_' AMD's clarification continues. '_We will provide additional details on "Zen 2" IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch._'



https://bit-tech.net/news/tech/cpus/amd-downplays-29-percent-zen-2-ipc-boost-reports/1/


----------



## $ReaPeR$ (Nov 18, 2018)

well.. based on their recent track record its quite plausible that with zen 2 they will reach and maybe even overtake (by a small margin) intel on the single core ipc, their only disadvantage/problem is basically clocks and that can be fixed relatively easily with a smaller node.


----------



## Adam Krazispeed (Nov 21, 2018)

btarunr said:


> AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule.
> 
> Lisa Su is very careful about the guidance she puts out.





actually.... it was an IPC uplift of 52%  from "excavator"

if this is another 29% MORE IPC Over zen 1, then Intel is done.....


----------

