# New multithreaded CPU benchmark: "Eight queens puzzle"



## BAGZZlash (Jun 10, 2016)

Today I accidentally stumbled over the "Eight queens puzzle". It's a game of placing n chess queens on an n×n chessboard so that no two queens threaten each other. Computationally, the challenge is to find the total number of solutions to that problem for differently sized (n) chessboards. The larger the board, the more solutions and the longer the computation time, of course. For n = 18, the number of solutions is 666,090,624. Solutions are known for n's of up to 26. For n >= 27, the number of solutions is yet unknown.

Back in 2011, takaken wrote a neat algorithm to compute the number of solutions based on multithreaded C code. I figured to turn this into a multithreaded CPU benchmark. The code computes the number of solutions for n = 18. This is because for n = 17, computation would be too fast on modern CPUs and for n = 19, users would have to wait for too long for the result.

Please *download the attached file* to obtain the binary package, including the source code.

I compiled the code for Windows using GCC: gcc -fopenmp Q.c -o Q.exe.

Usage is simple: Type Q [_number of threads to use_]. For example, run Q 8 for eight threads. If no number, a non-valid number or a number greater that the maximum hardware capability is provided, the program sets the maximum possible number of threads.

Have fun! 



Spoiler: Here are some results on an Intel Core i7-4790 @ 3.6 GHz:





```
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 21.01

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 15.55

Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 10.38

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs):  9.53

Using 5 thread(s).
Elapsed time (hh:mm:ss:cs):  9.23

Using 6 thread(s).
Elapsed time (hh:mm:ss:cs):  8.59

Using 7 thread(s).
Elapsed time (hh:mm:ss:cs):  8.27

Using 8 thread(s).
Elapsed time (hh:mm:ss:cs):  7.69
```


----------



## Disparia (Jun 10, 2016)

i5-6600 stock with turbo on (between 3.6Ghz and 3.9Ghz)






Adding: FX-8120 Stock





Adding: A8-5500 Stock


----------



## Solaris17 (Jun 10, 2016)

This is cool.


----------



## uuuaaaaaa (Jun 10, 2016)

AMD PII x6 1100T @ 3.8GHz /NB @ 2800MHz/ 4X4Gb 6 - 7 - 6 - 16 - 1T @1333MHz

Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 19.31

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 14.08

Using 3 thread(s).
Elapsed time (hh:mm:ss:cs):  9.00

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs):  8.54

Using 5 thread(s).
Elapsed time (hh:mm:ss:cs):  8.14

Using 6 thread(s).
Elapsed time (hh:mm:ss:cs):  6.83


----------



## Vayra86 (Jun 10, 2016)

This is fucking beautiful in all its simplicity.


----------



## FordGT90Concept (Jun 10, 2016)

If it is not using a random seed, it's inefficient at finding a unique solution.

Core i7-6700K @ 4.0 GHz, 10 runs:


> Using 8 thread(s).
> Elapsed time (hh:mm:ss:cs):  *7.07*
> Using 8 thread(s).
> Elapsed time (hh:mm:ss:cs):  *7.55*
> ...




I wrote two similar programs (one stresses CPU, the other stresses RAM).  Depending on the seed, it can find a solution in a fraction of a second or several minutes.

Chess.exe = Knight's Tour (CPU intensive, very fast)
NumberMazeWpf.exe = number maze (RAM intensive, extremely variable)

Be very, very, very careful with NumberMaze.  If the grid is too big and sufficiently connected, it can eat all your memories....and your virtual memories too.


----------



## the54thvoid (Jun 10, 2016)

i7 3930k @ 4.2 GHz

12 threads - 3.96
6 threads - 5.68

8 threads - 5.68
4 threads - 6.91

Somebody run a skylake at 4.2 GHz so I can see how much I need to upgrade


----------



## FordGT90Concept (Jun 10, 2016)

Your 4 thread is faster than my 8 thread...


----------



## Liquid Cool (Jun 10, 2016)

Jizzler...

I ran my AMD A10-5750m to compare to your A8-5550m...Wanna trade?  .   Kidding of course....but I actually liked my laptop better with the original A8-5550m.  I recently "upgraded" to the A10.  My laptop seems a little slower.....





Best,

Liquid Cool


----------



## BAGZZlash (Jun 10, 2016)

FordGT90Concept said:


> If it is not using a random seed, it's inefficient at finding a unique solution.



It doesn't use anything random at all. Look at the source code, it uses deterministic backtracking.

Contributin':



Spoiler: Intel Core 2 Quad Q6600 @ Stock (2.4 GHz) - Wow, this is slow





```
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 32.35

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 27.29

Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 15.09

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 14.20
```






Spoiler: Intel Core 2 Duo E6750 @ Stock (2.66 GHz)





```
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 29.44

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 21.69
```


----------



## Dippyskoodlez (Jun 10, 2016)




----------



## PHaS3 (Jun 10, 2016)

3570K @ 4.5


----------



## FordGT90Concept (Jun 10, 2016)

BAGZZlash said:


> It doesn't use anything random at all. Look at the source code, it uses deterministic backtracking.


I edited since I said that.  I'm half tempted to code Knight's Corner-like solution that involves a random seed.  I suspect it can find a solution much, much faster but performance is at the mercy of the random seed.  My methodology would simulate the board (like Knight's Corner and Number Maze) where it throws up one queen (random position), updating the board with viable positions for the next, so on and so forth until it solves or gets stuck.  That doesn't work well as a benchmark though.


----------



## Drone (Jun 10, 2016)

Asus X555LD ultrabook


----------



## BAGZZlash (Jun 10, 2016)

FordGT90Concept said:


> I'm half tempted to code Knight's Corner-like solution that involves a random seed.


That would be great. The so-far known number of solutions for n = 26 has been computed back in 2009. Time to crack the n = 27! 



FordGT90Concept said:


> That doesn't work well as a benchmark though.


Does not matter much. Do you have this nice piece of hardware?


----------



## FordGT90Concept (Jun 10, 2016)

BAGZZlash said:


> That would be great. The so-far known number of solutions for n = 26 has been computed back in 2009. Time to crack the n = 27!


That's the problem with the random approach.  It finds a solution rather than all solutions.


----------



## truth teller (Jun 10, 2016)

pentium 4 670




wut?


----------



## uuuaaaaaa (Jun 10, 2016)

truth teller said:


> pentium 4 670
> 
> 
> 
> ...



I have to try my 478 3.4EE Gallatin on this! hahaha


----------



## Frick (Jun 10, 2016)

truth teller said:


> pentium 4 670
> 
> 
> 
> ...



Yeah the results I see are all over the place, it's hard to tell anything.


----------



## biffzinker (Jun 10, 2016)

Core i7-4790K @ 4.8 GHz/Uncore @ 4.5 GHz


----------



## natr0n (Jun 10, 2016)




----------



## uuuaaaaaa (Jun 10, 2016)

natr0n said:


>



That vcore  is it safe? My 1100T did 4.0Ghz at 1.42 on my CHIII formula...


----------



## natr0n (Jun 10, 2016)

uuuaaaaaa said:


> That vcore  is it safe? My 1100T did 4.0Ghz at 1.42 on my CHIII formula...



It was really 1.5125 in bios. I just loaded last saved oc profile and it was a bit high while testing. I have a shitty board. Its safe tho for little tests i have heatsinks on vrms and fans and such


----------



## ERazer (Jun 10, 2016)

i need to try this when i get home, sub


----------



## broken pixel (Jun 10, 2016)

Program foes not work for me? Says its testing 12 threads then the window clothes after a few seconds. ;*(


----------



## natr0n (Jun 10, 2016)

broken pixel said:


> Program foes not work for me? Says its testing 12 threads then the window clothes after a few seconds. ;*(




open command prompt as admin then drag and drop q.exe then hit enter


----------



## broken pixel (Jun 10, 2016)

I think I tried that, I will try again when I get back to the my office.


----------



## biffzinker (Jun 10, 2016)

When you unzip the file, hold down the shift key while right clicking on the folder, and choose the "Open Command Window Here."


----------



## silentbogo (Jun 10, 2016)

Forgot to make a screenshot, but the results are a bit off: 
Xeon X5650 running @3.3GHz
~24 sec for 1 thread
~6.5 sec for 12 threads
and everything in-between for 2,4,6,8 threads

CPU is barely breaking a sweat - 9-12% load regardless of #threads, and it does not even go to turbo (workload is not intense enough).
I did not look at the code yet, but I suspect there is something holding back the multi-threaded part. I did mess with openMP and OpenMPI (some simple image processing stuff) a few years ago and never seen such abnormal scaling...


----------



## jboydgolfer (Jun 10, 2016)

the only way i can get it to work is to hold shift, right click, open command window, then drag n drop, and it only runs a single pass @ 8 threads.
im sure i could modify that command line @ the top of the console, but i have no time to right now.
Intel Xeon E3 1231V3 @ 3.4-3.8Ghz
picked up the xeon which was Suppossed to come via newegg/Fedex, but wasnt delivered , got it local, lovinig the threads.sadly ill need to give it away to its owner tho


----------



## Dippyskoodlez (Jun 10, 2016)

silentbogo said:


> Forgot to make a screenshot, but the results are a bit off:
> Xeon X5650 running @3.3GHz
> ~24 sec for 1 thread
> ~6.5 sec for 12 threads
> ...



Yeah for something that resolves so fast, I would expect the 12 threads to have micromanagement issues before being able to really squeak out any speeds below 3s. continuous runs seem to have a cache reliance as well.


----------



## FordGT90Concept (Jun 10, 2016)

Dippyskoodlez said:


> Yeah for something that resolves so fast, I would expect the 12 threads to have micromanagement issues before being able to really squeak out any speeds below 3s. continuous runs seem to have a cache reliance as well.


I've ran it several times and it peaked out at 61% CPU.  It climbs up, hits a peak, then climbs down.

For comparison, I ran Chess repeatedly and it took about 20 times to find one that took longer than four seconds to execute (so Task Manager can register it) and the peak was 87%.  Chess is async multithreaded so as long as the stack of work is sufficiently large enough, it will load the CPU to 100%.

Q often doesn't even exceed 50% over 7 seconds.

Comparing the two programs in Task Manager, I think it is possible that a 12 core, finishing in ~3 seconds, may never be able to reach 100% load.

First thing I'd try is making the board larger.  12 threads should take at least 5-10 seconds to finish.


----------



## Ferrum Master (Jun 10, 2016)

Mine are 

Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.69
Using 4 thread(s).
Elapsed time (hh:mm:ss:cs):  6.10
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs):  4.94
Using 12 thread(s).
Elapsed time (hh:mm:ss:cs):  3.48


----------



## Arctucas (Jun 11, 2016)

6700K @ 4.7

8 threads = 5.04


----------



## BAGZZlash (Jun 11, 2016)

Okay, here's a few things.

1.) I put the results we have so far into a table and sorted it by the one-thread computing times.

2.) I made a few changes to the program:
2a) Not entering (or entering an invalid) number of threads to use will now make the program iterate through all available threads settings. That is, if you have, say, four cores and just run the program, it will compute the results based on four threads, then three, then two, then one.
2b) For those of you with rather fast CPUs I added a command line option "large". This will switch n from 18 to 19. Lots of more computations to do, will take a minute even on the fastest of CPUs.
2c) In either case, the program will now wait for the user to press the enter key after it's done. This will prevent the window from closing.

For the larger chessboard I figured that the launched parallel threads may have very different computation times. For the "large" chessboard, I see a clear behavior on my quadcore: First, the CPU load hits 100%. After a while, it declines to 75%, showing that one of the four threads is done, the other three ones still working. After few seconds, the load drops to 50%, then 25%, then the program is done. Can you confirm this behavior?


----------



## Ferrum Master (Jun 11, 2016)

It works funny with many cores indeed.

Using 12 thread(s).
Elapsed time (hh:mm:ss:cs):  3.40
Using 11 thread(s).
Elapsed time (hh:mm:ss:cs):  3.35
Using 10 thread(s).
Elapsed time (hh:mm:ss:cs):  3.52
Using 9 thread(s).
Elapsed time (hh:mm:ss:cs):  3.54
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs):  4.95
Using 7 thread(s).
Elapsed time (hh:mm:ss:cs):  4.91
Using 6 thread(s).
Elapsed time (hh:mm:ss:cs):  4.86
Using 5 thread(s).
Elapsed time (hh:mm:ss:cs):  5.72
Using 4 thread(s).
Elapsed time (hh:mm:ss:cs):  6.01
Using 3 thread(s).
Elapsed time (hh:mm:ss:cs):  6.39
Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 10.00
Using 1 thread(s).

And with large

Using 12 thread(s).
Elapsed time (hh:mm:ss:cs): 22.61
Using 11 thread(s).
Elapsed time (hh:mm:ss:cs): 23.65
Using 10 thread(s).
Elapsed time (hh:mm:ss:cs): 23.36
Using 9 thread(s).
Elapsed time (hh:mm:ss:cs): 22.07
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 32.80
Using 7 thread(s).
Elapsed time (hh:mm:ss:cs): 32.79
Using 6 thread(s).
Elapsed time (hh:mm:ss:cs): 31.13
Using 5 thread(s).
Elapsed time (hh:mm:ss:cs): 40.04
Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 43.58
Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 52.90
Using 2 thread(s).
Elapsed time (hh:mm:ss:cs):  1:16.15
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs):  1:39.01


----------



## n0tiert (Jun 11, 2016)

C:\Users\n0tiert\Downloads\Q>q 8
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs):  8.33

FX-8150@4GHZ

if i add "q.exe N" it only runs once, is that correct ?


----------



## FordGT90Concept (Jun 11, 2016)

BAGZZlash said:


> For the larger chessboard I figured that the launched parallel threads may have very different computation times. For the "large" chessboard, I see a clear behavior on my quadcore: First, the CPU load hits 100%. After a while, it declines to 75%, showing that one of the four threads is done, the other three ones still working. After few seconds, the load drops to 50%, then 25%, then the program is done. Can you confirm this behavior?


It went from 100% for a  2-3 seconds then it plummeted to 25% in under a second or two and stayed there for a while.  It presumably then went down to 12.5% and finished.  It does not fully utilize the CPU for long.




In the picture, where you see it shoot up from <50% back up to 100%, that's WCG taking back the idle clocks so disregard that...

I believe the steps are:
8 cores -> 6 cores -> 5 cores -> 4 cores -> 3 cores -> 2 cores/BOINC 8 cores
100% -> 75% -> 62.5% -> 50% -> 37.5% -> 25% (doesn't reach it before BOINC takes over)

Over half of the time the program runs, it's using 25% or less of the resources available to it.

Might I suggest using a Queue on the main thread and each core pulling off a job from it?  That's how I usually do it.


Also, why run on less than maximum cores by default?


----------



## biffzinker (Jun 11, 2016)

The screenshot of Task Manager was with the command line option "large." It's a capture of the 8 thread then 7 thread.


----------



## Aquinus (Jun 11, 2016)

@FordGT90Concept and @biffzinker, depending on how the OP designed the benchmark, you may not see 100% usage even though up to 8 threads are being used.

As Ford said: 


FordGT90Concept said:


> Might I suggest using a Queue on the main thread and each core pulling off a job from it? That's how I usually do it.


This is useful if and only if the algorithm is brute forcing the solutions. Using a queue for divvying out tasks is the most basic way to accelerate purely parallel workloads (such as dispatching at the job level,) such as doing it brute force but, it's not the most efficient way to solve the problem. A divide and conquer algorithm will eventually exhibit the behavior that you two are describing. That is, as the task is broken apart, it can utilize more cores and a lot of applications can benefit from this to a point but, requires threads joining up on each other when their "slice" of the calculation is complete, it doesn't allow the computer to put that thread to work elsewhere since there is still a significant amount of serial work that needs to be completed.

Depending on how @BAGZZlash implemented it, a little bit of heuristics, queueing, or deeper level of concurrency aside from the brute-force way might improve performance significantly. I did notice that the OP used OpenMP which means the application is probably written in C or C++ which means that a big hurdle is actually creating the multi-threaded part. I would argue a more dynamic language with richer data structures would probably help at the expense of some computational power.

I did take a peek at the link the OP provided and it appears that the algorithm is most definitely a brute force method that has minimal heuristics. I'm tempted to create my own version but, it probably won't be in C/C++ but rather a language like Clojure.

Would anyone be interested? If there are, it might be an incentive for me to do it.


----------



## Steevo (Jun 11, 2016)

My 1100T

6 Threads 6:38
1 Thread 18:11


----------



## cdawall (Jun 11, 2016)




----------



## biffzinker (Jun 11, 2016)

Aquinus said:


> @FordGT90Concept and @biffzinker, depending on how the OP designed the benchmark, you may not see 100% usage even though up to 8 threads are being used.


Wasn't a biggie to me, thought I would try to contribute in another way if it helped.


----------



## Enterprise24 (Jun 11, 2016)

i5-6500 @ 5Ghz


----------



## truth teller (Jun 11, 2016)

BAGZZlash said:


> I put the results we have so far into a table and sorted it by the one-thread computing times.


this p4 on top of anything, lol, thats a first




and here is the usage for 
	
	



```
q 2 large
```


----------



## Enterprise24 (Jun 11, 2016)

still pass...


----------



## FordGT90Concept (Jun 11, 2016)

Aquinus said:


> I did notice that the OP used OpenMP which means the application is probably written in C or C++ which means that a big hurdle is actually creating the multi-threaded part. I would argue a more dynamic language with richer data structures would probably help at the expense of some computational power.


It is C and the source is in the ZIP ("Q.c").  I don't know enough of C to sort through it to see if queuing is possible.


----------



## JrockTech (Jun 11, 2016)

FX 8320 @ 4.95Ghz


----------



## BiggieShady (Jun 11, 2016)

FordGT90Concept said:


> It is C and the source is in the ZIP ("Q.c").  I don't know enough of C to sort through it to see if queuing is possible.


Not really possible because openmp handles thread scheduling by itself, you just use _#pragma omp parallel for_ construct before your for loop and the openmp distributes iterations to different threads.
What you can do is choose from 4 modes for scheduler: static, dynamic, guided or runtime. Last two are special cases of dynamic.
Basically static is with least locking, does simple round robin and expects that calculated iteration count in the for loop never changes so the chunks can be calculated at compile time.
Dynamic calculates all chunks in runtime and requires more locking.
Here the iteration count of the for loop that get parallelized is 18 and static scheduling could be used but each iteration is heavy and long running, so the granularity is too coarse to harvest more efficiency by modifying thread scheduling. This is why scaling is off and true scaling would be seen on 18+ core xeons.
Additionally this code could not be parallelized with finer granularity because only the calculation of each scenario of the first queen position (and the subsequent brute force search down the hierarchy) is independent of each other.


----------



## FordGT90Concept (Jun 11, 2016)

Explains why it falls to 2-3 (n=18 or 19 in the case of large) threads on my system.  It knocks out the first 8 (100% CPU), then the second 8 (100% falling off), leaving the remaining 2-3 (32.5% falling off).  Without major reworking of the algorithm, it does not make a good benchmark because of that bias.


----------



## Aquinus (Jun 11, 2016)

BiggieShady said:


> Not really possible because openmp handles thread scheduling by itself, you just use _#pragma omp parallel for_ construct before your for loop and the openmp distributes iterations to different threads.
> What you can do is choose from 4 modes for scheduler: static, dynamic, guided or runtime. Last two are special cases of dynamic.
> Basically static is with least locking, does simple round robin and expects that calculated iteration count in the for loop never changes so the chunks can be calculated at compile time.
> Dynamic calculates all chunks in runtime and requires more locking.
> ...


Thanks for the explanation. My C-foo is pretty weak since I haven't done it since being in school working on my undergrad but, that's what it was starting to look like to me as well. There are some things in the code that bother me, such as the "goto" statements. It feels like it was written by a dev that writes drivers or low-level OS code. It appears that practically no heuristics are used which also drives me nuts.

With that said, I've started writing another version that is written in Clojure that focuses on using set logic and sets of positions to find solutions instead. We'll see how it goes.


FordGT90Concept said:


> Explains why it falls to 2-3 (n=18 or 19 in the case of large) threads on my system.  It knocks out the first 8 (100% CPU), then the second 8 (100% falling off), leaving the remaining 2-3 (32.5% falling off).  Without major reworking of the algorithm, it does not make a good benchmark because of that bias.


This happens because some jobs finish faster than others so one thread might finish before another. OpenMP is a very old way of going about multi-threading and doesn't work well if each loop iteration doesn't take a consistent amount of time to execute. It's a bad way to build an application in my opinion, the dev needs more flexibility to express the problem at hand, not work around the shortcomings of the language. If I finish my version, I'll post it. I'm pretty determined to finish it right now but, my determination might waiver as the day goes on. I'm having fun thinking about it, that much is certain.


----------



## FordGT90Concept (Jun 11, 2016)

I think I would do either n^2 tasks or n^2 * the-next-move tasks.  That would sufficiently spread out the load to prevent major bias.

The C code is clearly designed for performance which is why it is written like a driver.


I'm really not in the mood for coding right now so go for it @Aquinus! 

Improving the multithreading of takaken 2011's code would be the fastest performance-wise though.


----------



## BiggieShady (Jun 11, 2016)

Aquinus said:


> There are some things in the code that bother me, such as the "goto" statements.


Yeah it gets weird when one implements recursive algorithms without actually using recursion  ... clojure as a functional programming language emphasizing recursive algorithms could be a better fit for this specific problem, but I don't know about performance since it's compiles as bytecode for JVM. Nevertheless I'm curious so I say go for it


----------



## stealth83 (Jun 11, 2016)




----------



## newtekie1 (Jun 11, 2016)

i7-4790K@4.6GHz





FX-8350@4.6GHz


----------



## JrockTech (Jun 11, 2016)

newtekie1 said:


> i7-4790K@4.6GHz
> 
> 
> FX-8350@4.6GHz



The 8350 HT link is supposed to sit around 2600, unless power saving features are dropping the value.


----------



## Aquinus (Jun 11, 2016)

BiggieShady said:


> Yeah it gets weird when one implements recursive algorithms without actually using recursion  ... clojure as a functional programming language emphasizing recursive algorithms could be a better fit for this specific problem, but I don't know about performance since it's compiles as bytecode for JVM. Nevertheless I'm curious so I say go for it


loop-recur does offer a non-stack consuming TCO-like way of solving the same problems but, that's not the main benefit. Immutable data = shared data under the hood which means less duplication of data and less consumption of system memory. For a problem like this, keeping the memory footprint down is important because as you use larger boards, more memory is required to manage the of each "possibility". I'm using heurisitics to find only the positions that it can move forward so, I'm not purely brute-forcing it either but, I think a good language to express a problem is sometimes worth the loss in performance. C is important if you need response times under a millisecond but, I would argue that a good expressive language can do the same thing, better, faster, and with less code.

Now that I've said that, I'm going to keep working on it because, I feel like I've set the bar kind of high for myself. 

Edit: Functional languages are nice because I can write my code like a math problem. I think that OO and traditional imperative code doesn't express these kinds of problems well.


----------



## TheHunter (Jun 11, 2016)

Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.62

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs):  9.94

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs):  6.01

Using 8 thread(s).
Elapsed time (hh:mm:ss:cs):  5.14

@



A little older Fritz chess benchmark is also a interesting benchmark, but gets  cpu a bit hotter.. Something like 3dmark Vantage physics hot.
https://www.chess.com/download/view/fritz-12-benchmark


----------



## JrRacinFan (Jun 11, 2016)




----------



## IRQ Conflict (Jun 12, 2016)

No old school results yet? Q9550 @3.5Ghz


Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 22.05

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 16.10

Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 10.28

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs):  9.71


----------



## FordGT90Concept (Jun 12, 2016)

Aquinus said:


> Edit: Functional languages are nice because I can write my code like a math problem. I think that OO and traditional imperative code doesn't express these kinds of problems well.


Both of my programs define Cell and Grid (aka board) objects.  All of the logic is in the cell.  Because of that, the recursion code is literally only a few lines.   The problem is I can't do the goto...label statements like takaken 2011's code does (or at least I think I can't) so it becomes a memory monster.


----------



## Aquinus (Jun 12, 2016)

FordGT90Concept said:


> Both of my programs define Cell and Grid (aka board) objects.  All of the logic is in the cell.  Because of that, the recursion code is literally only a few lines.   The problem is I can't do the goto...label statements like takaken 2011's code does (or at least I think I can't) so it becomes a memory monster.


That's exactly why OO languages are bad at expressing this kind of problem IMHO. The reality is that objects are a bad way to conceptualize a problem and it forces us to make some bad decisions about how we write our code. I have an initial version done but, it requires some optimization. It's nowhere near as fast but, it keeps track of all of the unique solutions and implements several stages of queues which the workers pull from (always taking from the queue closest to completion that has items to process.) I also have several ideas to improve performance (outside of multi-threading, already done that enough where I'm eating up my 3820's compute capability.)

Just a quick overview of how I attempted to  tackle the problem:

Each task cascades from one queue to the next with the available spots that could be made with a given combination of queens. The available spots are calculated using set logic. The available spots are a set and I have a function that calculates the set of invalid moves for any given position. The difference between the available positions and the new invalid positions tells us if we can make another move (if we're not at our target and there are no more moves, the job is done,) and if we've hit our target, the finished data is put on a "valid" queue where another thread takes those items and converts the list of queens into a set and adds those positions to a final set. Right now, the invalid move function is being called every time and it takes about ~5-6ms to run on my machine. Since none of these are changing, I'm considering do these calculations ahead of time so when the hard work is done, I'm merely doing a hash map lookup which should be significantly faster.

Doing it this way is most definitely heavier-weight and most definitely isn't as fast. For what it's worth, doing an 8x8 doesn't consume more than 1.2GB (so far,) using my method and increasing the board size should require more compute with my method, not too much more memory in comparison.

Either way, I have to finish it up and I have some optimization to do before it's ready for public testing.

This is what I have so far though: https://github.com/jrdoane/queens/blob/master/src/queens/core.clj

Edit: I just noticed that I can further reduce how many items "flow" up the series of queues by checking to see if I've encountered the set of queens before. The performance overhead of keeping track of that might be worth the speed up as it could be reducing the number of possible computations upstream by a significant number but, I'm not exactly certain yet.

Edit 2: Using a pre-calculated two level nested map improves calculation time for invalid spots from ~5-6ms to ~ 0.05-0.125ms. That change is now a no-brainer because the required storage and work ahead of time is minimal for such a huge gain.


----------



## RealNeil (Jun 12, 2016)

i7-4770K box.


----------



## Ahhzz (Jun 12, 2016)

pretty neat. /tag


----------



## Enterprise24 (Jun 12, 2016)

TheHunter said:


> Using 1 thread(s).
> Elapsed time (hh:mm:ss:cs): 13.62
> 
> Using 2 thread(s).
> ...



Fritz Chess is not so reliable benchmark if you use chess engine for real world. All chess engine although can scale very well with multiple cores (for example the strongest chess engine in the world Stockfish 7 x64 BMI2 can use 128 cores) but it can't utilize Hyperthreading properly.
HT will improve kilo node per sec by 30-40% but the strength of engine will suffer (measure in ELO rating) (go wider but less deeper).
That is why all chess engine manual said that you should turn off HT.

I love overclocking and also love chess.


----------



## broken pixel (Jun 12, 2016)

https://www.chess.com/forum/view/general/best-cpu-for-chess-engine-game-analysis 
a7 cc


----------

