AMD Dragged to Court over Core Count on "Bulldozer"

cdawall · Oct 3, 2016

FordGT90Concept said:
Considering IBM produces chips that are nearly identical to Bulldozer with four integer clusters and they don't call that a quad-core, I'd say AMD is definitively wrong.

Not to nit pick, but isn't this the exact opposite of what you said earlier? I though AMD was the only CPU to ever attempt this...

FordGT90Concept said:
They are wide cores. This lawsuit will likely force AMD to call them cores too.

Doubtful. AMD can create words to describe things just as well as the next guy. If AMD can't call what they consider a module a module, I guess Intel will have to ditch HyperThreading in favor for SMT. That is literally what you are saying needs to happen.

FordGT90Concept said:
It would still be a 4-threaded core. A lot of enterprise RISC processors already handle 8-threads per core (many FPUs and ALUs in each) so that isn't exactly new.

Difference is those only have ONE integer and ONE FPU, not TWO and ONE.

FordGT90Concept said:
Go ahead and run your benchmarks then. I'm waiting. Here's the post, by the way. Spoiler: it will never reach 95%+ that an actual dual core would.

I was very specific with the workloads that would show near 100% scaling, I would wager you cannot prove me wrong, but after reading your argument you find one useless benchmark (not real world scenario) that only uses the FPU for calculations and claim I am incorrect. As has been said a multitude of times the FPU isn't used for the majority of calculations. The real issue behind AMD isn't the configuration of the modules it is the shit design of the internal cores themselves. The module works excellent and if they were stronger cores the pure idea of this lawsuit wouldn't even exist. That my friend is actually the basic design of Zen mind you.

FordGT90Concept · Oct 3, 2016

cdawall said:
Not to nit pick, but isn't this the exact opposite of what you said earlier? I though AMD was the only CPU to ever attempt this...

Didn't know about POWER7/8 previously.

cdawall said:
Doubtful. AMD can create words to describe things just as well as the next guy. If AMD can't call what they consider a module a module, I guess Intel will have to ditch HyperThreading in favor for SMT. That is literally what you are saying needs to happen.

You can't call donkeys elephants and sell them as elephants without getting sued. AMD did as much, and got sued.

cdawall said:
Difference is those only have ONE integer and ONE FPU, not TWO and ONE.

Sure doesn't look like it in the diagram. It actually looks like there are two (one is just math, the other is math + load/store) and each one is two-wide. The only difference is that IBM didn't draw a box around it and say "herp, derp, dis is a 'core'"

cdawall said:
I was very specific with the workloads that would show near 100% scaling, I would wager you cannot prove me wrong.

I don't have FX-8350 to test. I've written a lot of programs that get near 100% scaling. Random Password Generator would actually be a pretty good test for this.

10 million attempts
uncheck special characters
check require special charaters (creates an unsolvable situation)
minimum characters 32 Edit: Added this one because it can massively impact time if it randomly does a lot of short ones
5.9142104
Disabled even number cores in Task Manager (it still spawns 8 threads)
13.2191610

123% faster

I'm gonna add a thread limiter to make it easier to test...

64K · Oct 3, 2016

FordGT90Concept said:
You can't call donkeys elephants and sell them as elephants without getting sued. AMD did as much, and got sued.

That's true and this madness needs to end right now. I'm fed up with Door To Door Donkey Salesmen trying to swindle me out of my hard earned money.

When this trial ends I wager that there will be a legal definition of a core if nothing else. It will be interesting to watch AMD backpedal at that time.

cdawall · Oct 3, 2016

FordGT90Concept said:
Didn't know about POWER7/8 previously.

Obviously...

FordGT90Concept said:
You can't call donkeys elephants and sell them as elephants without getting sued. AMD did as much, and got sued.

This is they didn't call a donkey an elephant they stepped away from what you believe is the status quo and produced something that was scalable in a way no manufacturer had done before.

FordGT90Concept said:
Sure doesn't look like it in the diagram. It actually looks like there are two (one is just math, the other is math + load/store) and each one is two-wide. The only difference is that IBM didn't draw a box around it and say "herp, derp, dis is a 'core'"

You do know why IBM doesn't have to draw boxes around things and explain what cores are correct? Thing about enterprise level equipment is function matters not the nonsense this lawsuit is about. I say it again this is literally an argument of a definition that doesn't exist.

FordGT90Concept said:
I don't have FX-8350 to test. I've written a lot of programs that get near 100% scaling. Random Password Generator would actually be a pretty good test for this.

I will give it a shot. I am slightly curious how much of a difference task schedulers make in the situation as well.

64K said:
That's true and this madness needs to end right now. I'm fed up with Door To Door Donkey Salesmen trying to swindle me out of my hard earned money.

When this trial ends I wager that there will be a legal definition of a core if nothing else. It will be interesting to watch AMD backpedal at that time.

Thing is all AMD has to do is stand strong instead of backpedaling. If they strong arm the lawsuit they will win, if they back pedal it will be assumed due to known guilt.

FordGT90Concept · Oct 3, 2016

cdawall said:
You do know why IBM doesn't have to draw boxes around things and explain what cores are correct? Thing about enterprise level equipment is function matters not the nonsense this lawsuit is about. I say it again this is literally an argument of a definition that doesn't exist.

No, it's because IBM knew they couldn't get away with selling the chip as a 16 or 32 "core" processor when it clearly only has 8 cores. You know, like FX-8350 clearly only have 4 cores.

cdawall said:
Thing is all AMD has to do is stand strong instead of backpedaling. If they strong arm the lawsuit they will win, if they back pedal it will be assumed due to known guilt.

You don't think Seagate tried to do the same when sued over HDD capacity? There is no path for AMD to win here.

This is debugging data...I'll upload updated program shortly...
1: 24.6987115
2: 13.1477996
3: 9.1914374
4: 7.4688438
5: 6.8086950
6: 6.2363480
7: 5.8927118
8: 5.7746498
...the application is working correctly. Big jumps between 1-4 where there's an actual core to do the work. Small jumps between 5-8 where HTT is kicking in. Beyond that, performance is expected to fall because the threads are fighting each other for time.

...once W1z lets me edit it that is...

1.1.4, 6700K, final...
8: 5.6961283
7: 5.7390397
6: 6.2014922
5: 6.7108575
4: 7.1342991
3: 8.8729954
2: 12.6389990
1: 24.1833987

cdawall · Oct 3, 2016

How do you edit minimum number of characters?

FordGT90Concept · Oct 3, 2016

Click on the row and it should become editable.

I'll attach 1.1.4 here until I can edit the other thread. Edit: Download here: https://www.techpowerup.com/forums/threads/random-password-generator.164777/page-2

Edit: I'm ready for FX-8350 data.

It does have up to 20% overhead per core.

Aquinus · Oct 4, 2016

FordGT90Concept said:
He also said blocking was possible. Cores never block other cores ergo not a dual core.

DMA is blocking and memory writes are usually write-back to cache, so does a shared L2 negate the possibility of being a core?

FordGT90Concept said:
Except that those "cores" don't understand x86 instructions. They understand opcodes given to them by the instruction decoder and fetcher. On the other hand, a real core (even the POWER7 and POWER8 behemoths) has the hardware to interpret instruction to a result without leaving the core. So either AMD's definition is wrong or Intel, IBM, ARM Holdings, and Sun are wrong. Considering IBM produces chips that are nearly identical to Bulldozer with four integer clusters and they don't call that a quad-core, I'd say AMD is definitively wrong.

POWER7 is only a behemoth in the sense that it has a strangely large number of discrete FPUs but, the smallest constant unit is the fixed point unit or combo of ALUs and AGUs. IBM produces CPUs that actually have a pretty large amount of floating point hardware given the fact that it's a general purpose CPU.

FordGT90Concept said:
All modern operating systems call FX-8350 a quad-core with 8 logical processors, not just Windows. When *nix has to work on POWER7 and Bulldozer, are they really going to use AMD's marketing terms to describe what is actually there? I'd hope not.

You say that like it's because of the definition of a core and not for the sake of how processes are scheduled in the kernel.

FordGT90Concept said:
Asyncronous multithreading is always capable of loading systems to 100% so long as it can spawn enough threads and those threads are sufficiently heavy. Overhead is only encountered at the start in the main thread and at the end of the worker thread (well under 1% of compute time).

That depends on how the application is architected. Most applications don't have 100% independent threads and even if they do, they usually require getting joined by a control thread that completes the calculation or whatever is going on. That one thread is going to wait for all the other dispatched ones to complete. Purely functional workloads are going to benefit the most from multiple cores because they have properties that allow for good memory locality (data will primarily reside in cache.) I've been writing multithreaded applications for several years now and I can tell you that in most cases, these workloads aren't purely async. More often than not, there are contested resources that limit overall throughput. Applications that can be made to be purely functional are prime examples of things that should be run on the GPU because of the lack of data dependencies on calculated values.

As a developer, if I have a thread that is not limited in most situation and will almost always give me a speed up of over 50% versus another thread, I consider it a core. It's tangible bandwidth that can be had and to me, that's all that matters. HTT only helps in select cases but more often than not, I can't get speed up beyond one or two threads over 4 on a quad-core Intel setup using hyperthreading where I can when bulldozer integer cores.

I'll agree with you that the integer core isn't what we've traditionally recognized as a core but, it has far too many dedicated resources to call it SMT. So while it might not be a traditional core, it's a lot more like a traditional core than like SMT.

FordGT90Concept · Oct 4, 2016

Aquinus said:
DMA is blocking and memory writes are usually write-back to cache, so does a shared L2 negate the possibility of being a core?

No because that can happen with any pool of memory with multiple threads accessing it.

Aquinus said:
POWER7 is only a behemoth in the sense that it has a strangely large number of discrete FPUs but, the smallest constant unit is the fixed point unit or combo of ALUs and AGUs. IBM produces CPUs that actually have a pretty large amount of floating point hardware given the fact that it's a general purpose CPU.

It has a lot of hardware on both accounts. Unlike UltraSPARC T1, it was designed to do well at everything...so long as it could be broken into a lot of threads.

Aquinus said:
You say that like it's because of the definition of a core and not for the sake of how processes are scheduled in the kernel.

They go hand-in-hand. Because a "module" really represents a "core" operating systems need to issue heavy threads to each core before scheduling a second heavy thread to the same cores. Windows XP (or was it 7?) got a patch to fix that on Bulldozer because the order of the cores reported to the operating system differed from HTT's. The OS needs to treat the two technologies similarly to maximize performance.

Aquinus said:
Most applications don't have 100% independent threads and even if they do, they usually require getting joined by a control thread that completes the calculation or whatever is going on.

Virtually every application I multithread does so asynchronously. The only interrupt is updating on progress (worker thread invokes main thread with data which the main thread grabs and carries out). That is probably why there is up to a 20% hit. I suppose I could reduce the number of notifications but...meh. 10 million 32 character passwords generated in <6 seconds is good enough. :roll:

SMT is a concept, not a hardware design. SMT is what Bulldozer does (two threads in one core).

Aquinus · Oct 4, 2016

FordGT90Concept said:
No because that can happen with any pool of memory with multiple threads accessing it.

That doesn't mean that latency is going to be consistent between cores without shared cache. Common L2 makes it advantageous to call two integer cores as a pair of logical cores because they share a local cache. Context switching between those two cores will result in better cache hit rates because data is likely to already reside in L2 if it was used on the other integer core. That improves performance because accessing memory is always slower than hitting cache. It improves latency because you're preserving memory locality, not because you don't understand that an integer core is that thing that does most of the heavy lifting in a general purpose CPU.

FordGT90Concept said:
It has a lot of hardware on both accounts. Unlike UltraSPARC T1, it was designed to do well at everything...so long as it could be broken into a lot of threads.

Except it doesn't do everything well. It has a huge emphasis on floating point performance which the POWER7 might be able to keep up with Haswell with but, when it comes to integer performance, it gets smacked down just like AMD does with floating point performance. Intel is successful because it beefs the hell out of its cores. It doesn't mean that what AMD provided is not a core, it just means that it's ability to compute per clock cycle is less than Intel's due to the difference inside the cores themselves, not because AMD hasn't produced something that can operate independently. If I really need to dig it out, I can pull up the table of dispatch rates per clock cycle for the most common x86 instructions on several x86-based CPUs. Haswell straight down dominates everything because it can do the most per clock cycle which makes sense because the FPU is gigantic and Intel has just been adding ALUs and AGUs.

FordGT90Concept said:
They go hand-in-hand. Because a "module" really represents a "core" operating systems need to issue heavy threads to each core before scheduling a second heavy thread to the same cores. Windows XP (or was it 7?) got a patch to fix that on Bulldozer because the order of the cores reported to the operating system differed from HTT's. The OS needs to treat the two technologies similarly to maximize performance.

As I stated earlier, it's due to memory locality. The Core 2 Quad, being two C2D dies on one chip, could have had improved performance by using this tactic as well because context switching on cores with a shared cache is faster than where there isn't. This isn't because they're not real cores, it's for scheduling purposes but, I'm sure you work with kernels and think about process scheduling all the time and would know this so, I'm just preaching to the choir.

FordGT90Concept said:
Virtually every application I multithread does so asynchronously. The only interrupt is updating on progress (worker thread invokes main thread with data). That is probably why there is up to a 20% hit. I suppose I could reduce the number of notifications but...meh. 10 million 32 character passwords generated in <6 seconds is good enough.

Sounds like a real world situation to me and by no means a theoretical one. :slap:

FordGT90Concept said:
SMT is a concept, not a hardware design. SMT is what Bulldozer does (two threads in one core).

SMT is most definitely a hardware design and to say otherwise is insane. Are you telling me that Intel didn't make any changes to their CPUs to support hyper-threading? That is a boatload of garbage. Bulldozer is two cores with shared hardware, most definitely not SMT. SMT is making your core wider to allow for instruction level parallelism which SMT can take advantage of during operations that don't utilize the entire core or eat up the entire part of a single stage of the pipeline. Bulldozer has dedicated hardware and registers. SMT implementations most definitely don't have a dedicated set of registers, ALUs, and AGUs. They utilize the extra hardware already in the core to squeeze in more throughput which is why hyperthreading gets you anywhere between 0 and 40% performance of a full core.

FordGT90Concept · Oct 4, 2016

Aquinus said:
That doesn't mean that latency is going to be consistent between cores without shared cache. Common L2 makes it advantageous to call two integer cores as a pair of logical cores because they share a local cache. Context switching between those two cores will result in better cache hit rates because data is likely to already reside in L2 if it was used on the other integer core. That improves performance because accessing memory is always slower than hitting cache. It improves latency because you're preserving memory locality, not because you don't understand that an integer core is that thing that does most of the heavy lifting in a general purpose CPU.

All caches, exist to improve performance. The more caches, with progressively longer response times, the better overall performance will be. L2 can be made part of the a core's design but it doesn't have to be. A core generally only needs L1 caches. An example of cores that have dedicated L1 and L2 is much of the Core I# family. Here's Sandy Bridge:

Again, distinguishing feature of a core is nothing is shared. L2 can't be considered part of Core 2 Duo's core because it is shared with the neighboring core.

Aquinus said:
Except it doesn't do everything well. It has a huge emphasis on floating point performance which the POWER7 might be able to keep up with Haswell with but, when it comes to integer performance, it gets smacked down just like AMD does with floating point performance. Intel is successful because it beefs the hell out of its cores. It doesn't mean that what AMD provided is not a core, it just means that it's ability to compute per clock cycle is less than Intel's due to the difference inside the cores themselves, not because AMD hasn't produced something that can operate independently. If I really need to dig it out, I can pull up the table of dispatch rates per clock cycle for the most common x86 instructions on several x86-based CPUs. Haswell straight down dominates everything because it can do the most per clock cycle which makes sense because the FPU is gigantic and Intel has just been adding ALUs and AGUs.

Which AMD did too, but stupidly required a separate thread to access them.

Aquinus said:
Sounds like a real world situation to me and by no means a theoretical one.

Well, you made me test it (everything else the same):
8: 5.6961283 sec -> 5.2848560 sec
1: 24.1833987 -> 24.1773771 sec

It stands to reason that 7 threads wouldn't see that difference because only where the main thread lies does it compete with the worker thread. 8 is still faster than 7 so...it's just a boost cutting out the UI updates.

Aquinus said:
SMT is most definitely a hardware design and to say otherwise is insane. Are you not telling me that Intel didn't make any changes to their CPUs to support hyper-threading? That is a boatload of garbage. Bulldozer is two cores with shared hardware, most definitely not SMT. SMT is making your core wider to allow for instruction level parallelism which SMT can take advantage of during operations that don't utilize the entire core or eat up the entire part of a single stage of the pipeline. Bulldozer has dedicated hardware and registers. SMT implementations most definitely don't have a dedicated set of registers, ALUs, and AGUs. They utilize the extra hardware already in the core to squeeze in more throughput which is why hyperthreading gets you anywhere between 0 and 40% performance.

SMT isn't defined by one implementation. It does describe HTT and Bulldozer well. Bulldozer takes away from single-threaded performance to boost multi-threaded performance where Intel does the opposite. At the end of the day, they are different means to the same end (more throughput without adding additional cores).

cdawall · Oct 4, 2016

FordGT90Concept said:
SMT is a concept, not a hardware design. SMT is what Bulldozer does (two threads in one core).

but it has two integer clusters that not only behave, but look like cores that merely lack an FPU which isn't used for 90% of instruction sets?

And again it can process two threads per core or 4 per module.

Aquinus · Oct 4, 2016

FordGT90Concept said:
All caches, exist to improve performance. The more caches, with progressively longer response times, the better overall performance will be. L2 can be made part of the a core's design but it doesn't have to be. A core generally only needs L1 caches. An example of cores that have dedicated L1 and L2 is much of the Core I# family. Here's Sandy Bridge:

That's not the point. There are benefits to scheduling processes that on cores with a shared cache. It doesn't really matter if you consider it to be part of the core or not. Where it is and how it operates is all that matters and what matters is that calling two cores logical pairs has the benefit of using local cache improving hit rates which improves overall performance. You're pretty picture doesn't really add anything to the discussion, it just shows that you know how to use Google.

FordGT90Concept said:
Which AMD did too, but stupidly required a separate thread to access them.

What are you talking about? AMD did the exact opposite by sharing an FPU and doubling the number of dedicated integer cores. IBM put an emphasis of doing pseudo-GPGPU-like floating point parallelism on the CPU where AMD put an emphasis on independent integer operation. You're comparing these two like they're the same but they're almost as different as a CPU versus a GPU.

FordGT90Concept said:
Well, you made me test it (everything else the same):
8: 5.6961283 sec -> 5.2848560 sec
1: 24.1833987 -> 24.1773771 sec

Feels pretty theoretical to me. I'm sure that's serving some purpose in the real world that's earning someone money.

FordGT90Concept said:
SMT isn't defined by one implementation. It does describe HTT and Bulldozer well. Bulldozer takes away from single-threaded performance to boost multi-threaded performance where Intel does the opposite. At the end of the day, they are different means to the same end (more throughput without adding additional cores).

SMT is defined by the implementation just as discrete computational core is. This isn't software we're talking about. The bold part is exactly what happened but, that doesn't mean they're not cores.

FordGT90Concept · Oct 4, 2016

cdawall said:
but it has two integer clusters that not only behave, but look like cores that merely lack an FPU which isn't used for 90% of instruction sets?

They do not behave like nor look like cores and FPU is a major exclusion.

cdawall said:
And again it can process two threads per core or 4 per module.

It cannot.

Aquinus said:
That's not the point. There are benefits to scheduling processes that on cores with a shared cache. It doesn't really matter if you consider it to be part of the core or not. Where it is and how it operates is all that matters and what matters is that calling two cores logical pairs has the benefit of using local cache improving hit rates which improves overall performance. You're pretty picture doesn't really add anything to the discussion, it just shows that you know how to use Google.

You're talking code that would have to be written for specific processors. That's not something that generally happens in the x86 world. I doubt even Intel's compiler (which is generally considered the best) exploits the shared L2 of Core 2 Duo in the way you are claiming.

Aquinus said:
What are you talking about? AMD did the exact opposite by sharing an FPU and doubling the number of dedicated integer cores. IBM put an emphasis of doing pseudo-GPGPU-like floating point parallelism on the CPU where AMD put an emphasis on independent integer operation. You're comparing these two like they're the same but they're almost as different as a CPU versus a GPU.

POWER7 pretty clearly has at least two integer clusters. The only difference between Bulldozer and POWER7 is that POWER7 has a "Unified Issue Queue" where Bulldozer had three separate schedulers (two integer and one floating). That said, each unit could have it's own scheduler (not finding anything that details the inner workings of the units).

There are a total of 12 execution units within each core: two fixed-point units, two loadstore units, four double-precision floatingpoint unit (FPU) pipelines, one vector, one branch execution unit (BRU), one condition register logic unit (CRU), and one decimal floating-point unit pipeline. The two loadstore pipes can also execute simple fixedpoint operations. The four FPU pipelines can each execute double-precision multiplyadd operations, accounting for 8 flops/cycle per core.

http://www.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=kalla_power7.pdf
It has quite the mix of hardware accelerating pretty much every conceivable task.

Aquinus said:
SMT is defined by the implementation just as discrete computational core is. This isn't software we're talking about. The bold part is exactly what happened but, that doesn't mean they're not cores.

I've yet to see any evidence that proves the module isn't a core and much to the contrary.

BiggieShady · Oct 4, 2016

FordGT90Concept said:
I've yet to see any evidence that proves the module isn't a core and much to the contrary.

While you are at it you should also seek evidence if a big rock and the small rock are both rocks, when clearly you can fit several small rocks inside of a big rock. (almost went with the car analogy :laugh:

because rocks have no inner workings but hey, they are silicon and also monolithic albeit not by design).
Nobody expects for small car to do same as a big car, but when it comes to cores people are suddenly acting like they are dealing with SI units, people valuing cpus by the core count might as well pay for cpus by the kilogram ... btw, I'm gladly trading one kilogram of celerons for same mass of i7s.
Maybe good automotive analogy would be 8 cylinder engine using one spark plug for pair of cylinders twice as often :laugh:

Anyway, operating systems are always dealing with pairs of logical processors to accommodate all possible (existing and non yet existing) physical organizations of execution units in modern super scalar cpu where both thread data dependency and pure thread parallelism are exploited for optimal gains in all scenarios. This setup is too generic and way too flexible to use logical processor from the OS as an argument in this case. AMD half a module core is a core albeit less potent and less scalable, it's not hyper threading - it's more scalable. Only in terms of scaling you could argue AMD core is less of a core than what is the norm (hence my market share tangent - intel is the norm). So the underdog in the duopoly is not putting enough asterisks and fine print on the marketing material = slap on the wrist (symbolic restitution and obligatory asterisk with fine print for the future*)

* may scale differently with different types of workload

Aquinus · Oct 4, 2016

FordGT90Concept said:
You're talking code that would have to be written for specific processors. That's not something that generally happens in the x86 world. I doubt even Intel's compiler (which is generally considered the best) exploits the shared L2 of Core 2 Duo in the way you are claiming.

You have absolutely no idea what you're talking about, Ford. The application doesn't need to know anything about cache because it's used automatically. When a memory access occurs, cache is usually hit first because latency to check it is relatively fast. A thread moving from one core to another core on the same L2 is likely to have better hit rates at lower latencies because it's using write-back data from when it was executing on the other core. There is no code that has to be written to do this, it just happens because when the memory address is looked up, is in a cached range, and exists, it will use it. I find it laughable you think the compiler is responsible for this. It's not like software is recompiled to handle different cache configurations.

FordGT90Concept said:
They do not behave like nor look like cores and FPU is a major exclusion.

Actually for general purpose computation, it's not a major execution unit because the core can run without it. Just because you think it's necessary doesn't mean everyone agrees with you. The FPU also has never been treated as a core, always as an addition to it and additions can be removed.

FordGT90Concept said:
It cannot.

I'm pretty sure he meant a thread per core and two threads per module and yes, it can. Just because speed up isn't perfect doesn't mean that it isn't but, speed up is a hell of a lot better than just about every SMT implementation.

FordGT90Concept said:
POWER7 pretty clearly has at least two integer clusters.

Two fixed point units and two load store units is another way of saying two ALUs and two AGUs.

FordGT90Concept said:
http://www.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=kalla_power7.pdf
It has quite the mix of hardware accelerating pretty much every conceivable task.

Interesting, other than being able to partition cores into virtual CPUs for the purpose of dispatching along with an actual SMT implementation, it sounds exactly like x86. You do realize this is exactly how just about every implementation of a super scalar architecture begins but, I'm sure you'll Google that in no time.

FordGT90Concept said:
I've yet to see any evidence that proves the module isn't a core and much to the contrary.

That's because you're, a: drawing hard lines on something that's a bit arm wavy and a bit vague and b: using Google to help you make that case. That doesn't mean that you understand what you're reading even if you think you do. Just because you read it on the internet doesn't instantly make you an expert on the subject, it means you know how to use Google.

cdawall · Oct 4, 2016

Aquinus said:
I'm pretty sure he meant a thread per core and two threads per module and yes, it can. Just because speed up isn't perfect doesn't mean that it isn't but, speed up is a hell of a lot better than just about every SMT implementation.

Nope 2 and 4 is what the scheduler can handle. With many many more in the queue.

FordGT90Concept · Oct 4, 2016

Aquinus said:
You have absolutely no idea what you're talking about, Ford. The application doesn't need to know anything about cache because it's used automatically. When a memory access occurs, cache is usually hit first because latency to check it is relatively fast. A thread moving from one core to another core on the same L2 is likely to have better hit rates at lower latencies because it's using write-back data from when it was executing on the other core. There is no code that has to be written to do this, it just happens because when the memory address is looked up, is in a cached range, and exists, it will use it. I find it laughable you think the compiler is responsible for this. It's not like software is recompiled to handle different cache configurations.

Oh, you're talking context switching. Most of the data is going to be on in L3 which virtually all desktop processors have now. This is why the Core I# series has a small L1, small L2 and big L3. Only the L3 is shared across all of the cores. That said, some architectures let cores make requests of other core's L2 for this very purpose.

Aquinus said:
Interesting, other than being able to partition cores into virtual CPUs for the purpose of dispatching along with an actual SMT implementation...

Except that IBM consistently calls the whole monolithic block a "core" accepting 8 threads. You know, like sane people do. :laugh:

Aquinus said:
That's because you're, a: drawing hard lines on something that's a bit arm wavy and a bit vague and b: using Google to help you make that case.

a) Only AMD is "wavy and a big vague" because they see profit in lying to the public.
b) I haven't used Google in a long time.

cdawall said:
Nope 2 and 4 is what the scheduler can handle. With many many more in the queue.

I don't know whether that claims is true or not but I do know that if there is more than two threads in the core (as in "module"), those threads will have to be in a wait state. Bulldozer can't execute more than two at a time.

BiggieShady · Oct 4, 2016

FordGT90Concept said:
I don't know whether that claims is true or not but I do know that if there is more than two threads in the core (as in "module"), those threads will have to be in a wait state. Bulldozer can't execute more than two at a time.

Confusion comes from the fact it can dispatch 16 instructions per clock which means nothing for superscalar processor core count wise ... other than whole super scaling aspect additionally mudding the definition of a core ... add to that list also using word thread for both hardware thread and software thread

Aquinus · Oct 4, 2016

FordGT90Concept said:
Oh, you're talking context switching. Most of the data is going to be on in L3 which virtually all desktop processors have now. This is why the Core I# series has a small L1, small L2 and big L3. Only the L3 is shared across all of the cores. That said, some architectures let cores make requests of other core's L2 for this very purpose.

Or maybe a smaller L2 is faster, takes up less room, and has better latency characteristics than a large one. When L2 is large, you want hit rates to be high because going down to L3 or memory is going to be extra costly given the initial added latency for accessing a larger SRAM array. Switching contexts to a core with a common cache improves performance more than you would think because the further away you get from it, the more time it's going to take to get data in that context. It's the same reason why you have the kernel aware of "cores" and "processors" because generally speaking, switching between processors is less costly than switching between cores within a processor which is less costly than switching between logical cores. It's just exploiting how a kernel scheduler works.

FordGT90Concept said:
Except that IBM consistently calls the whole monolithic block a "core" accepting 8 threads. You know, like sane people do.

That's because any less integer hardware and it couldn't do much of anything at all by itself. :laugh:

FordGT90Concept said:
a) Only AMD is "wavy and a big vague" because they see profit in lying to the public.

Slimming out their cores to get more of them isn't misleading the public. The public in general simply doesn't understand what more cores means and it doesn't always mean better performance. That's not AMD's fault. Maybe Intel should be sued for Netburst being shit despite having high clocks. "But it runs at 3.6Ghz!" Give me a freaking break and tell people to stop being so damn lazy and learn about what they're using.

FordGT90Concept said:
b) I haven't used Google in a long time.

I'm sure you have all of those images stored on your hard drive, ready to go a moments notice. :roll:

FordGT90Concept said:
I don't know whether that claims is true or not but I do know that if there is more than two threads in the core (as in "module"), those threads will have to be in a wait state. Bulldozer can't execute more than two at a time.

Being a super-scalar CPU, it can execute several instructions at once but, it depends on which instructions they are and how they're ordered and that's per integer core. They have their own dedicated hardware that can do multiple instructions at once depending at which part of the pipeline is going to be utilized. Two different mechanisms to handle superscalar instruction level parallelism, to me, says core. Each integer core having its own L1-d cache also seems to indicate to me a core since a core cares about its own data and not the other cores on the calculation at hand.

FordGT90Concept · Oct 4, 2016

Still waiting on FX-8### data. Chart really doesn't prove anything without it.

cdawall · Oct 5, 2016

FordGT90Concept said:
Still waiting on FX-8### data. Chart really doesn't prove anything without it.

I just set my FX9370 back up at home, haven't had time to test it yet.

BiggieShady · Oct 5, 2016

I'm also interested in benchmark numbers and scaling @cdawall, although I'm not sure about effect of @FordGT90Concept application being .NET based. Instructions are in common intermediate language executed on stack based "virtual machine" process running on register based CPU. We must assume .NET runtime is well optimized for bulldozer arch (or maybe someone knows :laugh:

).

Sadly, there are not many compiler flags usable for bulldozer in windows even when built directly to machine code with ms compiler. All we have is generic optimizations, more generic /favor:AMD64 and some more generic /arch:[IA32|SSE|SSE2|AVX|AVX2]

Linux folks like bulldozer bit more because they have GCC with "magical" -march=bdver1 compiler option, and AMD's own continuation of Open64 compiler ... also all libraries are easily rebuilt in the appropriate "flavor"

FordGT90Concept · Oct 5, 2016

This uses WPF so the only way it would work on Linux is emulated.

cdawall · Oct 5, 2016

I used it on my 5960x and couldn't get consistent results...

System Name	Moving into the mobile space
Processor	7940HS
Motherboard	HP trash
Cooling	HP trash
Memory	2x8GB
Video Card(s)	4070 mobile
Storage	512GB+2TB NVME
Display(s)	some 165hz thing that isn't as nice as it sounded

System Name	BY-2021
Processor	AMD Ryzen 7 5800X (65w eco profile)
Motherboard	MSI B550 Gaming Plus
Cooling	Scythe Mugen (rev 5)
Memory	2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s)	AMD Radeon RX 7900 XT
Storage	Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s)	Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case	Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s)	Realtek ALC1150, Micca OriGen+
Power Supply	Enermax Platimax 850w
Mouse	Nixeus REVEL-X
Keyboard	Tesoro Excalibur
Software	Windows 10 Home 64-bit
Benchmark Scores	Faster than the tortoise; slower than the hare.

Processor	i7 7700k
Motherboard	MSI Z270 SLI Plus
Cooling	CM Hyper 212 EVO
Memory	2 x 8 GB Corsair Vengeance
Video Card(s)	Temporary MSI RTX 4070 Super
Storage	Samsung 850 EVO 250 GB and WD Black 4TB
Display(s)	Temporary Viewsonic 4K 60 Hz
Case	Corsair Obsidian 750D Airflow Edition
Audio Device(s)	Onboard
Power Supply	EVGA SuperNova 850 W Gold
Mouse	Logitech G502
Keyboard	Logitech G105
Software	Windows 10

System Name	Moving into the mobile space
Processor	7940HS
Motherboard	HP trash
Cooling	HP trash
Memory	2x8GB
Video Card(s)	4070 mobile
Storage	512GB+2TB NVME
Display(s)	some 165hz thing that isn't as nice as it sounded

System Name	BY-2021
Processor	AMD Ryzen 7 5800X (65w eco profile)
Motherboard	MSI B550 Gaming Plus
Cooling	Scythe Mugen (rev 5)
Memory	2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s)	AMD Radeon RX 7900 XT
Storage	Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s)	Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case	Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s)	Realtek ALC1150, Micca OriGen+
Power Supply	Enermax Platimax 850w
Mouse	Nixeus REVEL-X
Keyboard	Tesoro Excalibur
Software	Windows 10 Home 64-bit
Benchmark Scores	Faster than the tortoise; slower than the hare.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

System Name	Windows 10 64-bit Core i7 6700
Processor	Intel Core i7 6700
Motherboard	Asus Z170M-PLUS
Cooling	Corsair AIO
Memory	2 x 8 GB Kingston DDR4 2666
Video Card(s)	Gigabyte NVIDIA GeForce GTX 1060 6GB
Storage	Western Digital Caviar Blue 1 TB, Seagate Baracuda 1 TB
Display(s)	Dell P2414H
Case	Corsair Carbide Air 540
Audio Device(s)	Realtek HD Audio
Power Supply	Corsair TX v2 650W
Mouse	Steelseries Sensei
Keyboard	CM Storm Quickfire Pro, Cherry MX Reds
Software	MS Windows 10 Pro 64-bit

AMD Dragged to Court over Core Count on "Bulldozer"

where the hell are my stars

"I go fast!1!11!1!"

where the hell are my stars

"I go fast!1!11!1!"

where the hell are my stars

"I go fast!1!11!1!"

Resident Wat-man

"I go fast!1!11!1!"

Resident Wat-man

"I go fast!1!11!1!"

where the hell are my stars

Resident Wat-man

"I go fast!1!11!1!"

Resident Wat-man

where the hell are my stars

"I go fast!1!11!1!"

Resident Wat-man

"I go fast!1!11!1!"

where the hell are my stars

"I go fast!1!11!1!"

where the hell are my stars