Intel Leads a New Era of Computing Experiences

Steevo · Jun 3, 2015

Aquinus said:
I'm leveraging my experience as a software dev and as a person with a degree in Comp Sci on this matter because it's highly applicable and the record needs to be set straight before you start putting weird ideas in people's heads and I'll explain why.

This is easier said than done. You can't simply make all workloads parallel because most code written is serial, there are dependencies between instructions and as a result, would need to share state and memory. The issues here is that overhead of making it parallel might introduce less performance if there is enough coordination that has to occur because most applications will do work to build up data, not build it in parallel, as a result, instructions will depend on the changes from the last (see serializability). It is the developer's job to write code in a way that can leverage hardware when it's available and to know when to apply it because not everything can be done on multiple cores by virtue of the task being done. The state problem is huge and locking will destroy multi-threaded performance. As a result, many successful multi-threaded systems have parts of an application decoupled with queues placed inbetween tasks that much occur serially. This all sounds find and dandy, but now you're just passing state through essentially a series of queues and any work will have to wait if any part of the pipeline slows down (so you're still limited by the least parallel part of your application.) So while you inherently get parallelism by doing this, you add the side effect of increasing latency and not knowing the state of something at any level other than the one it's operating at. So in the case of games, they need to react quickly to less latency means either lower frame rate or input lag.

So while your comment makes sense from looking at it from a high level, it makes absolutely no sense at a low level because that isn't how computers work under the hood and you can't simply break apart an application and make it multi-threaded. It simply doesn't work that way.

I would argue that developers need better tools for easily implementing games that can utilize multiple cores but, languages (like Java, Clojure, C#, etc.) have great constructs for multi-threaded programming. The question isn't if you can do it, the question what is the best way to do it and no one really knows.

I tell people this all the time: "If it were that easy, it would have been done already!"

All 8 of the SATA ports in my tower are used. One M.2 and be done with it, but it's really not designed for mass storage. SATA will always have a place in my book until something better rolls around that doesn't require the device to be attached to the motherboard. (Imagine mounting a spinning drive to a motherboard, that makes for some pretty funny images.

Side note: I had more issues with IDE cables than SATA, so I'm not complaining.

Having done some programming I agree entirely. For multithreaded workloads with dependencies you have to have a parent thread start and dispatch child processes or threads to perform the work while the parent thread synchronizes the results, and that cannot be done on hardware as the hardware has no idea of the actual work being done, it only sees a string of binary. So at the point where we have a parent thread and at least one child thread, and so if we break it down to human language.

Parent thread launches on core 0, it copies in a large dataset for comparison, it then launches a child thread that starts at the top of the thread, and it starts at the bottom, and it then has to check up on a flag being set by the child thread to see if it found a match thus reducing its performance. Try to find customer 123456789 in a dataset that is alpha numeric and customers numbers are randomized so either you have to sort, search the whole table, or you could have it check using a smarter algorithm. Depending on the size of the dataset and processor speed it might be as fast to sort on a single core as it is to search and compare on multiple cores.

Aquinus · Jun 3, 2015

Steevo said:
Parent thread launches on core 0, it copies in a large dataset for comparison, it then launches a child thread that starts at the top of the thread, and it starts at the bottom, and it then has to check up on a flag being set by the child thread to see if it found a match thus reducing its performance. Try to find customer 123456789 in a dataset that is alpha numeric and customers numbers are randomized so either you have to sort, search the whole table, or you could have it check using a smarter algorithm. Depending on the size of the dataset and processor speed it might be as fast to sort on a single core as it is to search and compare on multiple cores.

I see what you're trying to say, but sorting is a bad example because it's working with immutable data and some algorithms can be implemented to be multi-threaded (and algorithms like merge sort are a great example of the divide and conquer strategy.) So consider this example.

You have 4 variables, A, B, C, and D.
A is assigned an initial value of 4.
B is assigned an initial value of 7.
C is assigned an initial value of C is A plus B.
D is assigned an initial value of C plus B.
C is reassigned a value of current C plus D.
D is reassigned the value of C.

If you try to make this application multi-threaded, to successfully get the intended result, all of the used variables have to be locked to prevent another thread for accessing them, but lets consider there is absolutely no thread safety and we say "Thread 1 does even instructions, thread 2 does odd."

What if this is a single core machine running multiple threads? If thread 2 runs first, it will seg fault or null exception will get thrown because D isn't set yet on instruction 3.

So thread 1 has some instructions that look like this (assuming variables were declared in a serial fashion before threads are spun up.)

B is assigned an initial value of 7.
D is assigned an initial value of C plus B.
D is reassigned the value of C.

So thread 2 has something that looks like this:

A is assigned an initial value of 4.
C is assigned an initial value of C is A plus B.
C is reassigned a value of current C plus D.

Tell me, what happens first, instruction 1 on thread 1 or instruction 1 on thread 2? Welcome to the world of race conditions, where you require locking to be able to deterministically know that stuff happens in order. So lets say it doesn't matter. Instruction one and two and independent using different spots in memory, instruction 2 on either though expects that C is already set. So now you're hitting a case where Thread 1 and 2 need to be aware of what the other is doing. That's overhead and wasted time because a single thread could have been done already. Then consider instruction 3 on either, you're already in a non-deterministic state because of the race condition on the C variable, so now D will be unreliable.

This only begins to touch on the complications of SMP and implementing software to effectively utilize it. Add locking and it will run in lock step fine, but you're probably taking twice as long to do it to handle the coordination.

Applicable experience: I've written a priority based, multi-threaded, system integration service at work that can make hundreds of API calls per second based on database state changes compared to a similar PHP tool we have that does a mere ~450 API calls in 5 minutes. Trust me, I know what more cores can do but, I also know what they can't do as a result.

Brusfantomet · Jun 3, 2015

Aquinus said:
I see what you're trying to say, but sorting is a bad example because it's working with immutable data and some algorithms can be implemented to be multi-threaded (and algorithms like merge sort are a great example of the divide and conquer strategy.) So consider this example.

You have 4 variables, A, B, C, and D.
A is assigned an initial value of 4.
B is assigned an initial value of 7.
C is assigned an initial value of C is A plus B.
D is assigned an initial value of C plus B.
C is reassigned a value of current C plus D.
D is reassigned the value of C.

If you try to make this application multi-threaded, to successfully get the intended result, all of the used variables have to be locked to prevent another thread for accessing them, but lets consider there is absolutely no thread safety and we say "Thread 1 does even instructions, thread 2 does odd."

What if this is a single core machine running multiple threads? If thread 2 runs first, it will seg fault or null exception will get thrown because D isn't set yet on instruction 3.

So thread 1 has some instructions that look like this (assuming variables were declared in a serial fashion before threads are spun up.)

B is assigned an initial value of 7.

D is assigned an initial value of C plus B.

D is reassigned the value of C.

So thread 2 has something that looks like this:

A is assigned an initial value of 4.

C is assigned an initial value of C is A plus B.

C is reassigned a value of current C plus D.

Tell me, what happens first, instruction 1 on thread 1 or instruction 1 on thread 2? Welcome to the world of race conditions, where you require locking to be able to deterministically know that stuff happens in order. So lets say it doesn't matter. Instruction one and two and independent using different spots in memory, instruction 2 on either though expects that C is already set. So now you're hitting a case where Thread 1 and 2 need to be aware of what the other is doing. That's overhead and wasted time because a single thread could have been done already. Then consider instruction 3 on either, you're already in a non-deterministic state because of the race condition on the C variable, so now D will be unreliable.

This only begins to touch on the complications of SMP and implementing software to effectively utilize it. Add locking and it will run in lock step fine, but you're probably taking twice as long to do it to handle the coordination.

If you want it even more difficult add time constraints to that problem, like a real time machine, and languages like Ada and Occam starts to sound fun.

But there is a reason for CPUs for consumers still being stuck on 4 cores, while the server cpus now have 18 cores and 36 threads, and that is the fact that the normal stuff people does on their computer is not optimized for many cores, being because of serialized problems or because the consumer software is considered good enough not to warrant the extra cost having engineers drastically change the program to make them preform better, just look at the Lime mp3 encoder.

Aquinus said:
All 8 of the SATA ports in my tower are used. One M.2 and be done with it, but it's really not designed for mass storage. SATA will always have a place in my book until something better rolls around that doesn't require the device to be attached to the motherboard. (Imagine mounting a spinning drive to a motherboard, that makes for some pretty funny images.

Side note: I had more issues with IDE cables than SATA, so I'm not complaining.

That is what NAS is for, I have one SSD and no ODD in my tower.

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

System Name	Games/internet/usage
Processor	I7 5820k 4.2 Ghz
Motherboard	ASUS X99-A2
Cooling	custom water loop for cpu and gpu
Memory	16GiB Crucial Ballistix Sport 2666 MHz
Video Card(s)	Radeon Rx 6800 XT
Storage	Samsung XP941 500 GB + 1 TB SSD
Display(s)	Dell 3008WFP
Case	Caselabs Magnum M8
Audio Device(s)	Shiit Modi 2 Uber -> Matrix m-stage -> HD650
Power Supply	beQuiet dark power pro 1200W
Mouse	Logitech MX518
Keyboard	Corsair K95 RGB
Software	Win 10 Pro

Intel Leads a New Era of Computing Experiences

Steevo

Aquinus

Resident Wat-man

Brusfantomet