• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Leads a New Era of Computing Experiences

Joined
Nov 4, 2005
Messages
12,013 (1.72/day)
System Name Compy 386
Processor 7800X3D
Motherboard Asus
Cooling Air for now.....
Memory 64 GB DDR5 6400Mhz
Video Card(s) 7900XTX 310 Merc
Storage Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s) 55" Samsung 4K HDR
Audio Device(s) ATI HDMI
Mouse Logitech MX518
Keyboard Razer
Software A lot.
Benchmark Scores Its fast. Enough.
I'm leveraging my experience as a software dev and as a person with a degree in Comp Sci on this matter because it's highly applicable and the record needs to be set straight before you start putting weird ideas in people's heads and I'll explain why.

This is easier said than done. You can't simply make all workloads parallel because most code written is serial, there are dependencies between instructions and as a result, would need to share state and memory. The issues here is that overhead of making it parallel might introduce less performance if there is enough coordination that has to occur because most applications will do work to build up data, not build it in parallel, as a result, instructions will depend on the changes from the last (see serializability). It is the developer's job to write code in a way that can leverage hardware when it's available and to know when to apply it because not everything can be done on multiple cores by virtue of the task being done. The state problem is huge and locking will destroy multi-threaded performance. As a result, many successful multi-threaded systems have parts of an application decoupled with queues placed inbetween tasks that much occur serially. This all sounds find and dandy, but now you're just passing state through essentially a series of queues and any work will have to wait if any part of the pipeline slows down (so you're still limited by the least parallel part of your application.) So while you inherently get parallelism by doing this, you add the side effect of increasing latency and not knowing the state of something at any level other than the one it's operating at. So in the case of games, they need to react quickly to less latency means either lower frame rate or input lag.

So while your comment makes sense from looking at it from a high level, it makes absolutely no sense at a low level because that isn't how computers work under the hood and you can't simply break apart an application and make it multi-threaded. It simply doesn't work that way.

I would argue that developers need better tools for easily implementing games that can utilize multiple cores but, languages (like Java, Clojure, C#, etc.) have great constructs for multi-threaded programming. The question isn't if you can do it, the question what is the best way to do it and no one really knows.

I tell people this all the time: "If it were that easy, it would have been done already!"


All 8 of the SATA ports in my tower are used. One M.2 and be done with it, but it's really not designed for mass storage. SATA will always have a place in my book until something better rolls around that doesn't require the device to be attached to the motherboard. (Imagine mounting a spinning drive to a motherboard, that makes for some pretty funny images. :p

Side note: I had more issues with IDE cables than SATA, so I'm not complaining.


Having done some programming I agree entirely. For multithreaded workloads with dependencies you have to have a parent thread start and dispatch child processes or threads to perform the work while the parent thread synchronizes the results, and that cannot be done on hardware as the hardware has no idea of the actual work being done, it only sees a string of binary. So at the point where we have a parent thread and at least one child thread, and so if we break it down to human language.


Parent thread launches on core 0, it copies in a large dataset for comparison, it then launches a child thread that starts at the top of the thread, and it starts at the bottom, and it then has to check up on a flag being set by the child thread to see if it found a match thus reducing its performance. Try to find customer 123456789 in a dataset that is alpha numeric and customers numbers are randomized so either you have to sort, search the whole table, or you could have it check using a smarter algorithm. Depending on the size of the dataset and processor speed it might be as fast to sort on a single core as it is to search and compare on multiple cores.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.80/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Parent thread launches on core 0, it copies in a large dataset for comparison, it then launches a child thread that starts at the top of the thread, and it starts at the bottom, and it then has to check up on a flag being set by the child thread to see if it found a match thus reducing its performance. Try to find customer 123456789 in a dataset that is alpha numeric and customers numbers are randomized so either you have to sort, search the whole table, or you could have it check using a smarter algorithm. Depending on the size of the dataset and processor speed it might be as fast to sort on a single core as it is to search and compare on multiple cores.
I see what you're trying to say, but sorting is a bad example because it's working with immutable data and some algorithms can be implemented to be multi-threaded (and algorithms like merge sort are a great example of the divide and conquer strategy.) So consider this example.

You have 4 variables, A, B, C, and D.
A is assigned an initial value of 4.
B is assigned an initial value of 7.
C is assigned an initial value of C is A plus B.
D is assigned an initial value of C plus B.
C is reassigned a value of current C plus D.
D is reassigned the value of C.

If you try to make this application multi-threaded, to successfully get the intended result, all of the used variables have to be locked to prevent another thread for accessing them, but lets consider there is absolutely no thread safety and we say "Thread 1 does even instructions, thread 2 does odd."

What if this is a single core machine running multiple threads? If thread 2 runs first, it will seg fault or null exception will get thrown because D isn't set yet on instruction 3.

So thread 1 has some instructions that look like this (assuming variables were declared in a serial fashion before threads are spun up.)
  1. B is assigned an initial value of 7.
  2. D is assigned an initial value of C plus B.
  3. D is reassigned the value of C.

So thread 2 has something that looks like this:
  1. A is assigned an initial value of 4.
  2. C is assigned an initial value of C is A plus B.
  3. C is reassigned a value of current C plus D.

Tell me, what happens first, instruction 1 on thread 1 or instruction 1 on thread 2? Welcome to the world of race conditions, where you require locking to be able to deterministically know that stuff happens in order. So lets say it doesn't matter. Instruction one and two and independent using different spots in memory, instruction 2 on either though expects that C is already set. So now you're hitting a case where Thread 1 and 2 need to be aware of what the other is doing. That's overhead and wasted time because a single thread could have been done already. Then consider instruction 3 on either, you're already in a non-deterministic state because of the race condition on the C variable, so now D will be unreliable.

This only begins to touch on the complications of SMP and implementing software to effectively utilize it. Add locking and it will run in lock step fine, but you're probably taking twice as long to do it to handle the coordination.

Applicable experience: I've written a priority based, multi-threaded, system integration service at work that can make hundreds of API calls per second based on database state changes compared to a similar PHP tool we have that does a mere ~450 API calls in 5 minutes. Trust me, I know what more cores can do but, I also know what they can't do as a result.
 
Last edited:
Joined
Mar 23, 2012
Messages
777 (0.17/day)
Location
Norway
System Name Games/internet/usage
Processor I7 5820k 4.2 Ghz
Motherboard ASUS X99-A2
Cooling custom water loop for cpu and gpu
Memory 16GiB Crucial Ballistix Sport 2666 MHz
Video Card(s) Radeon Rx 6800 XT
Storage Samsung XP941 500 GB + 1 TB SSD
Display(s) Dell 3008WFP
Case Caselabs Magnum M8
Audio Device(s) Shiit Modi 2 Uber -> Matrix m-stage -> HD650
Power Supply beQuiet dark power pro 1200W
Mouse Logitech MX518
Keyboard Corsair K95 RGB
Software Win 10 Pro
I see what you're trying to say, but sorting is a bad example because it's working with immutable data and some algorithms can be implemented to be multi-threaded (and algorithms like merge sort are a great example of the divide and conquer strategy.) So consider this example.

You have 4 variables, A, B, C, and D.
A is assigned an initial value of 4.
B is assigned an initial value of 7.
C is assigned an initial value of C is A plus B.
D is assigned an initial value of C plus B.
C is reassigned a value of current C plus D.
D is reassigned the value of C.

If you try to make this application multi-threaded, to successfully get the intended result, all of the used variables have to be locked to prevent another thread for accessing them, but lets consider there is absolutely no thread safety and we say "Thread 1 does even instructions, thread 2 does odd."

What if this is a single core machine running multiple threads? If thread 2 runs first, it will seg fault or null exception will get thrown because D isn't set yet on instruction 3.

So thread 1 has some instructions that look like this (assuming variables were declared in a serial fashion before threads are spun up.)
  1. B is assigned an initial value of 7.
  2. D is assigned an initial value of C plus B.
  3. D is reassigned the value of C.

So thread 2 has something that looks like this:
  1. A is assigned an initial value of 4.
  2. C is assigned an initial value of C is A plus B.
  3. C is reassigned a value of current C plus D.

Tell me, what happens first, instruction 1 on thread 1 or instruction 1 on thread 2? Welcome to the world of race conditions, where you require locking to be able to deterministically know that stuff happens in order. So lets say it doesn't matter. Instruction one and two and independent using different spots in memory, instruction 2 on either though expects that C is already set. So now you're hitting a case where Thread 1 and 2 need to be aware of what the other is doing. That's overhead and wasted time because a single thread could have been done already. Then consider instruction 3 on either, you're already in a non-deterministic state because of the race condition on the C variable, so now D will be unreliable.

This only begins to touch on the complications of SMP and implementing software to effectively utilize it. Add locking and it will run in lock step fine, but you're probably taking twice as long to do it to handle the coordination.

If you want it even more difficult add time constraints to that problem, like a real time machine, and languages like Ada and Occam starts to sound fun.

But there is a reason for CPUs for consumers still being stuck on 4 cores, while the server cpus now have 18 cores and 36 threads, and that is the fact that the normal stuff people does on their computer is not optimized for many cores, being because of serialized problems or because the consumer software is considered good enough not to warrant the extra cost having engineers drastically change the program to make them preform better, just look at the Lime mp3 encoder.

All 8 of the SATA ports in my tower are used. One M.2 and be done with it, but it's really not designed for mass storage. SATA will always have a place in my book until something better rolls around that doesn't require the device to be attached to the motherboard. (Imagine mounting a spinning drive to a motherboard, that makes for some pretty funny images. :p

Side note: I had more issues with IDE cables than SATA, so I'm not complaining.

That is what NAS is for, I have one SSD and no ODD in my tower.
 
Top