# Multi Core PI @ LINPACK



## ovidiutabla (Feb 9, 2013)

I developed a multithreaded CPU benchmark that calculates PI decimals using Bailey–Borwein–Plouffe formula. The benchmark is using a multithreaded algorithm written in C++ and provide excellent parallelism. Multi Core PI is written in Visual C++ using MFC and Win32API.

*How it works*

A slider will help you set the decimals of PI, from 10.000 to 100.000. Default is 80.000. Just hit Run benchmark button to start benching your CPU.

*Submit to HWBOT*

First, press Take Screenshot button. A screenshot and a XML datafile will be created. Attention! CPUZ must be running!
Second, follow the link provided on the dialog and submit your datafile to HWBOT.

*Supported operating systems*

Microsoft Windows XP / Server 2003
Microsoft Windows Vista / 7
Microsoft Windows 8 / Server 2012

*Download link*

http://www.pcgamingxtreme.ro/


----------



## HammerON (Feb 10, 2013)

My results:






100% 12 thread utilization


----------



## Aquinus (Feb 10, 2013)

HammerON said:


> 100% 12 thread utilization



 I've explained this already multiple times and people seem too ignorant to listen and you're the last person I should need to explain this to.

Disable hyper-threading and run it again, please.


----------



## HammerON (Feb 10, 2013)

Wow - that was amazing 
My time was increased by almost 100%....  What else would I expect when disabling HT???


----------



## Aquinus (Feb 10, 2013)

HammerON said:


> http://img.techpowerup.org/130210/Capture113.jpg
> 
> Wow - that was amazing
> My time was increased by almost 100%....  What else would I expect when disabling HT???



Well that confuses me even more. I disable HT on mine and my score goes from 18.5 to 19. :|

HT should never result in 100% improvement. There aren't the resources available to let it scale like that. That should be more like a 15-30% drop in performance on average.

Edit: I lied that was Multi Core PRIME not MC PI, they look exactly the same sans the formula so I didn't notice it off the bat. My skepticism from PRIME worked its way over here. Either way I disabled HT and now it runs slower by about 60%. That's a bit more normal. I'm less skeptical about this benchmark and more about the prime one (unless your storing the output in a float or a double and not a fixed point number, in that case the computer is chugging for nothing). Since floating point numbers are not exact and as you go more decimals in, the precision of further decimals decreases.

4c w/ HT:




4c w/o HT:




Once again is the output being verified? Can you do multiple runs per benchmark to make sure that every runs results are consistent and once again, I would like output so I can verify the benchmarks results so I can put my skepticism at ease. As it stands, something is happening on my rig and I don't know what it is or if it is right.


----------



## uuuaaaaaa (Feb 10, 2013)

My Phenom II x6 is slow


----------



## AphexDreamer (Feb 10, 2013)

uuuaaaaaa said:


> My Phenom II x6 is slow



Faster than my FX6100 apparently...


----------



## Aquinus (Feb 10, 2013)

AphexDreamer said:


> Faster than my FX6100 apparently...
> http://img.techpowerup.org/130209/MultiCorePIScreenShot.jpeg



It's because the FPU is getting used for this benchmark. Keep in mind that each module only has one FPU so without FMA3 optimizations you're only going to see 3-cores worth of performance out of it. However if this used fixed point instead of floating point, this could use the integer cores which are faster in general and performances significantly better on AMD's newer processors. Fixed point also offers a higher level of precision, floating point is inaccurate because of how it converts decimals to and from base 2 integers.


----------



## AphexDreamer (Feb 10, 2013)

Aquinus said:


> It's because the FPU is getting used for this benchmark. Keep in mind that each module only has one FPU so without FMA3 optimizations you're only going to see 3-cores worth of performance out of it. However if this used fixed point instead of floating point, this could use the integer cores which are faster in general and performances significantly better on AMD's newer processors. Fixed point also offers a higher level of precision, floating point is inaccurate because of how it converts decimals to and from base 2 integers.



Which is why I had asked him if he would/could make a more FX optimized benchmark but he said it is FX optimized as it was coded with an FX processor. http://www.techpowerup.com/forums/showpost.php?p=2842045&postcount=68


----------



## uuuaaaaaa (Feb 10, 2013)

AphexDreamer said:


> Faster than my FX6100 apparently...
> http://img.techpowerup.org/130209/MultiCorePIScreenShot.jpeg



I wasn't expecting this. You also have a much higher clock.


----------



## Bo$$ (Feb 10, 2013)

Maybe looks a little low here


----------



## Melvis (Feb 10, 2013)




----------



## CrackerJack (Feb 10, 2013)




----------



## Arctucas (Feb 10, 2013)




----------



## LAN_deRf_HA (Feb 10, 2013)




----------



## lemonadesoda (Feb 10, 2013)

Great x86 kernel 5.x compatible!

I think a REALLY USEFUL statistic would be the time / cores / GHz so that we can see the "efficiency" of the FP core!


----------



## ovidiutabla (Feb 11, 2013)




----------



## ovidiutabla (Feb 14, 2013)

I have implemented encryption for the XML datafile! Now cheaters can't cheat anymore.

Current version is 2.101

*Download link:*

http://www.pcgamingxtreme.ro/


----------



## Deleted member 74752 (Feb 14, 2013)




----------



## Deleted member 74752 (Feb 14, 2013)




----------



## cadaveca (Feb 14, 2013)

Aquinus said:


> It's because the FPU is getting used for this benchmark. Keep in mind that each module only has one FPU so without FMA3 optimizations you're only going to see 3-cores worth of performance out of it. However if this used fixed point instead of floating point, this could use the integer cores which are faster in general and performances significantly better on AMD's newer processors. Fixed point also offers a higher level of precision, floating point is inaccurate because of how it converts decimals to and from base 2 integers.



PD emulates x87 entirely, hence the slowdown, IMHO. FPU doesn't matter when you aren't capable of running the instruction in the first place.


----------



## ovidiutabla (Feb 15, 2013)

*I removed the slider. *

Default setting for benchmark is 80.000 decimals. The target is to submit to HWBOT and we have to make sure that all users are benching at the same settings [80k decimals]

*Download Link:*

www.pcgamingxtreme.ro


----------



## Aquinus (Feb 15, 2013)

cadaveca said:


> PD emulates x87 entirely, hence the slowdown, IMHO.


Pardon me, I know what x87 is but I don't know what you mean when you say "PD", could you clarify?


cadaveca said:


> FPU doesn't matter when you aren't capable of running the instruction in the first place.


I agree but do we know that the benchmark isn't executing x87 instructions in the first place?

Also floating point emulation is worse than just using floating point numbers to begin with. You really need the exact value if you want your result of pi to be at all accurate. As that decimal place goes further out you're going to start losing precision.


----------



## ovidiutabla (Feb 17, 2013)

Aquinus said:


> Pardon me, I know what x87 is but I don't know what you mean when you say "PD", could you clarify?
> 
> I agree but do we know that the benchmark isn't executing x87 instructions in the first place?
> 
> Also floating point emulation is worse than just using floating point numbers to begin with. You really need the exact value if you want your result of pi to be at all accurate. As that decimal place goes further out you're going to start losing precision.



The application is compiled using *Streaming SIMD Extensions 2 (/arch:SSE2)* setting in order to replace FPU instructions with SSE code.


----------



## Aquinus (Feb 17, 2013)

t.phase said:


> The application is compiled using *Streaming SIMD Extensions 2 (/arch:SSE2)* setting in order to replace FPU instructions with SSE code.



SSE still utilizes the FPU, but that answers part of my question. I'm still curious what Cadaveca meant by "PD" though.


----------



## ovidiutabla (Mar 14, 2013)

New download link:

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## mlee49 (Mar 14, 2013)

Happy Pi day friends!!!


----------



## ovidiutabla (Apr 23, 2013)

New look:







Download link:

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## spectatorx (Apr 23, 2013)

New download link with id=666 xD Evil benchmark! 

Going to check it out.

Ok, program tested. All 4 cores in 100% usage during the test so it is really good job done  And i hope you will revert slider. Here is my score:






And for incorrect cpu speed detection i would suggest you to perform little bench while detecting speed to make readings proper.


----------



## de.das.dude (Apr 23, 2013)

why no decimal sliderrr?????


----------



## ovidiutabla (Apr 24, 2013)

*Ultimate Multi Core PI*

The graphical user interface has been rewritten in C# and WPF 4.0 [the application needs Net Framework 4.0].

The core of the benchmark was moved to a DLL written in Visual C + + [calculation algorithm and parallelization] so that benchmarks results are not altered at all.

*New look:*






*Download link:*

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## ovidiutabla (Apr 24, 2013)

de.das.dude said:


> why no decimal sliderrr?????



Because we want to be sure that all users are benching with the same settings... Any requests beside decimal slider?


----------



## Aquinus (Apr 24, 2013)

t.phase said:


> Because we want to be sure that all users are benching with the same settings... Any requests beside decimal slider?



Having just a handful of presets might be preferable if people want some other options.


----------



## Melvis (Apr 25, 2013)

The latest version crashes when i try to run it!


----------



## ovidiutabla (Apr 25, 2013)

I uploaded a newer version of the application, that will display the error message. Please try now.

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## ovidiutabla (Apr 25, 2013)

*New look:*






*Download link:*

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## ovidiutabla (Apr 30, 2013)

de.das.dude said:


> why no decimal sliderrr?????





Aquinus said:


> Having just a handful of presets might be preferable if people want some other options.



*HERE YOU GO SIRE*

Custom Benchmark! 

You have a slider now, so you can chose the number of decimals for PI. Up to 360.000! For HWBOT submission, you must run the benchmark with default setting 80k decimals.






*Download link*

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## Arctucas (Apr 30, 2013)

I gave the new version a try at 360,000, but it never completes.


----------



## ovidiutabla (May 2, 2013)

Arctucas said:


> I gave the new version a try at 360,000, but it never completes.



It's going to take a very very long time to complete with 360.000 decimals. It will complete at one time, just leave the benchmark running. It's exponential complexity. For 10k decimals its takes in 0 sec, 800ms, for 20k decimals 2 sec 900ms... and for 80k decimals 54 sec. CPU: i5 3330 @ 3Ghz, 4 cores.


----------



## KapiteinKoek007 (May 2, 2013)

heres my score.. altough i must say with my 2600k on default 3.5ghz 8 threads its about 7 sec faster then with HT off and @ 4.6 ghz.... what's that about?


----------



## Random Murderer (May 2, 2013)

KapiteinKoek007 said:


> heres my score.. altough i must say with my 2600k on default 3.5ghz 8 threads its about 7 sec faster then with HT off and @ 4.6 ghz.... what's that about?
> 
> http://i42.tinypic.com/2vwtzme.jpg



It's multi-threaded. 8 threads at 3.5GHz > 4 threads at 4.6GHz in multithreaded apps(that can utilize 8 threads) because you can get much more throughput.


----------



## newtekie1 (May 2, 2013)

I clicked on the link to download it and it just says "You've been permanently banned from the forums"...

Guess I don't get to try it.

NVM:  Used a proxy to get it...which kind of shows the pointlessness of using IP bans.


----------



## KapiteinKoek007 (May 2, 2013)

with HT ON 4.6ghz. but why should we turn off HT ?  more and more games/software are being designed for multithreaded optimization..


----------



## newtekie1 (May 2, 2013)

Here is mine.  The correct frequency is 3.6GHz though:






*Possible bug:* I use multiple monitors. When I clicked Submit to HWBOT to get the screen shot, it took a screenshot of my primary monitor, which didn't have the Multi Core Pi window on it.

Also, why it is so slow?  I'd expect a multi-threaded Pi program to calculate Pi a heck of a lot faster than this.  Or did you intentionally make it slow because it is meant to be a benchmark?


----------



## KapiteinKoek007 (May 2, 2013)

newtekie1 said:


> I clicked on the link to download it and it just says "You've been permanently banned from the forums"...
> 
> Guess I don't get to try it.
> 
> NVM:  Used a proxy to get it...which kind of shows the pointlessness of using IP bans.



maybe u where banned for another reason, or someone used your ip as a mask to do some crazy stuff...


----------



## newtekie1 (May 2, 2013)

KapiteinKoek007 said:


> maybe u where banned for another reason, or someone used your ip as a mask to do some crazy stuff...



Never been on those forums before in my life.  Ah the life of having a dynamic IP...


----------



## KapiteinKoek007 (May 2, 2013)

newtekie1 said:


> Never been on those forums before in my life.  Ah the life of having a dynamic IP...



hmm that's strange then. try to contact the websites admin or mods or something.. to rectify the problem


----------



## Aquinus (May 2, 2013)

Works well except the CPU Frequency is wrong. I set power options to performance to force the CPU clock to 4.5Ghz, but it doesn't make a difference.


----------



## Random Murderer (May 2, 2013)

Aquinus said:


> Works well except the CPU Frequency is wrong. I set power options to performance to force the CPU clock to 4.5Ghz, but it doesn't make a difference.
> 
> http://www.techpowerup.com/forums/attachment.php?attachmentid=50987&stc=1&d=1367517500



It probably doesn't read straps, so it's seeing 100x36 rather than 125x36.


----------



## Aquinus (May 2, 2013)

Random Murderer said:


> It probably doesn't read straps, so it's seeing 100x36 rather than 125x36.



Very true. Still a bug nonetheless.


----------



## Arctucas (May 2, 2013)

t.phase said:


> It's going to take a very very long time to complete with 360.000 decimals. It will complete at one time, just leave the benchmark running. It's exponential complexity. For 10k decimals its takes in 0 sec, 800ms, for 20k decimals 2 sec 900ms... and for 80k decimals 54 sec. CPU: i5 3330 @ 3Ghz, 4 cores.



I got 28 seconds, 30 ms for 80K on the previous version.

What would you estimate my time should be for 360K on the new version?

I let it run for approximately 10 minutes with no result.


EDIT:

I feel rather sheepish, I should let it run a few more seconds rather than being impatient:


----------



## newtekie1 (May 2, 2013)

Arctucas said:


> I got 28 seconds, 30 ms for 80K on the previous version.
> 
> What would you estimate my time should be for 360K on the new version?
> 
> I let it run for approximately 10 minutes with no result.



Just by my quick math based on the times I'm getting as I increase, you're looking at over an hour to complete 360,000 decimal places.


----------



## Mydog (May 2, 2013)

Had to try this one 

ok for a 24/7 summer OC


----------



## ovidiutabla (May 3, 2013)

newtekie1 said:


> Just by my quick math based on the times I'm getting as I increase, you're looking at over an hour to complete 360,000 decimal places.



Something like that. Just leave the benchmark running...


----------



## Arctucas (May 3, 2013)

So, is something wrong with result above?


----------



## Mydog (May 3, 2013)

Tested 360.000 decimals with HT


----------



## newtekie1 (May 3, 2013)

Arctucas said:


> So, is something wrong with result above?



Apparently not because it to my x6 about 20 minutes to finish.

I guess it doesn't scale exactly exponentially like I thought.


----------



## unclewebb (May 4, 2013)

Thanks for the multi-threaded benchmark.


----------



## Aquinus (May 4, 2013)

I did a couple tests with my 3820 and threw the results into an OpenOffice spreadsheet to make some graphs out of it. Enjoy if anyone cares. 

It almost looks to me as if it completes in O(n log n) time as far as how many decimals per second get calculated on average for any given decimal length but the increasing number of elements is creating a linear increase in times, so it almost feels like something O(n + n log n) or O((n + n) log n) time if I were to take a guess. I'm not really up for getting more data and doing the math to confirm my hunch. That's also for just my 3820 with 4c/8t, I'm sure it scales differently on different hardware.


----------



## ovidiutabla (May 4, 2013)

Very nice work sire.

Metro UI style:






Download link:

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## Aquinus (May 4, 2013)

I feel that I should also note that crunching will get my CPU up to 72-74*C but even for 360 decimals my CPU barely broke 62*C fully loaded with this. Just an observation because crunching for the same amount of time makes that much more heat despite both applications loading the CPU to 100%.


----------



## newtekie1 (May 4, 2013)

Crunching likely uses more areas of the CPU, different instruction sets, better use of the cache, etc. because crunching is designed to be as efficient as possible.  While this benchmark seem to be purposely inefficient to make the calculation take a lot longer than it should in order to get results that are more suited to a benchmark(several seconds instead of several ms).

Also, for the LOLs:


----------



## ovidiutabla (May 6, 2013)

UI Update [logo with alpha channel]


----------



## ovidiutabla (May 10, 2013)

newtekie1 said:


> Crunching likely uses more areas of the CPU, different instruction sets, better use of the cache, etc. because crunching is designed to be as efficient as possible.  While this benchmark seem to be purposely inefficient to make the calculation take a lot longer than it should in order to get results that are more suited to a benchmark(several seconds instead of several ms).
> 
> Also, for the LOLs:
> 
> http://www.techpowerup.com/forums/attachment.php?attachmentid=51011&stc=1&d=1367677366



The benchmark is using a very complex formula to calculate decimals of PI.

*Bailey–Borwein–Plouffe formula*



> The Bailey–Borwein–Plouffe formula (BBP formula) provides a spigot algorithm for the computation of the nth binary digit of pi (symbol: π) using base 16 math.
> 
> The formula can directly calculate the value of any given digit of π without the need to calculate the preceding digits.
> 
> The BBP is a summation-style formula that was discovered in 1995 by Simon Plouffe and was named after the authors of the paper in which the formula was published, David H. Bailey, Peter Borwein, and Simon Plouffe. Before that paper, it had been published by Plouffe on his own site.[1]



*The formula is:*






The algorithm is very complex, is slow, but i chose it because it's best suited for parallelization.

The whole ideea was to develop a perfect multithreaded benchmark that can make use of all the cores available, not to implement the fastest algorithm to calculate PI.



> *The BBP formula for π*
> 
> The original BBP π summation formula was found in 1995 by Plouffe using PSLQ. It is also representable using the P function above:
> 
> ...





> *y-cruncher* is the first efficient and publicly available Pi-calculator that can sustain *a near 100%* cpu load on multi-core computers.
> 
> There are other multi-threaded Pi-programs that can achieve high cpu usage, *but few of them can sustain it through an entire Pi computation*.
> 
> Below is a typical CPU utilization graph of y-cruncher when computing 1 billion digits of Pi across 8 cores.





> As of 2010,* I am not aware of any Pi-program that achieves perfect parallelism *for small computations and is at least half the speed of y-cruncher.



In 2013, meet Multi Core PI sire. Perfect parralelism for any number of decimals.



> (It's easy to get perfect parallelism if you artificially make the task really slow.)



I did NOT artificially make the task really slow, in fact, I didn't made anything that slows down the algorithm.

Sure, the Multi Core PI algorithm was not optimized for speed but provide *perfect parallelism* and that was the whole ideea:


----------



## newtekie1 (May 10, 2013)

Thanks for the explanation. 

I wasn't knocking you, you achieved exactly what you set out to do and it makes a great benchmark.


----------



## ovidiutabla (May 16, 2013)

*Multi Core LINPACK Ultimate*

Meet Multi Core LINPACK Ultimate! 

A multithreaded CPU benchmark that performs numerical linear algebra. It makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.

The benchmark is written in C# / WPF [The User Interface], C++ [The Core Algorithm] and provide excellent parallelism.






*How it works*

Default setting for benchmark is a Matrix size of 4000. Just hit <Run benchmark> button to start benching your CPU.

*Submit to HWBOT*

First, press <Submit to HWBOT> button. A screenshot of the entire screen and a crypted XML datafile will be created. Attention! CPUZ must be running!
Second, follow the link provided on the dialog and submit your datafile to HWBOT.

*HWBOT*

http://hwbot.org/benchmark/multi_core_linpack_ultimate/

*Supported operating systems*

Microsoft Windows XP / Server 2003
Microsoft Windows Vista / 7
Microsoft Windows 8 / Server 2012

*Website*

http://www.pcgamingxtreme.ro/multi-core-linpack-ultimate/

*Download Link*

http://www.pcgamingxtreme.ro/forum/download/file.php?id=690


----------



## ovidiutabla (May 16, 2013)

*UI Update*






*Download link*

http://www.pcgamingxtreme.ro/forum/download/file.php?id=690


----------



## Arctucas (May 16, 2013)




----------



## Feänor (May 16, 2013)

Poor little g540...


----------



## TRWOV (May 19, 2013)




----------



## cheesy999 (May 19, 2013)

Feanor said:


> Poor little g540...



I suspect this benchmark might like Intel processors a little bit more than AMD. Unless I'm reading it wrong.


----------



## agent00skid (May 19, 2013)

Seems to run fine on my AMD processor.


----------



## TRWOV (May 19, 2013)

cheesy999 said:


> I suspect this benchmark might like Intel processors a little bit more than AMD. Unless I'm reading it wrong.
> 
> http://img.techpowerup.org/130519/benchmark.png



It does give unconsistent results, I give you that. agent00skid's A6-3500 APU gets better times than your unlocked X4 and it's a triple-core. I thought it might be related to instruction sets but the Phenom II and Llano support the same instructions.

Maybe memory bandwidth plays a role too?


edit: Maybe your X4 is throttling? Watch the CPU-Z readout while the benchmark is running.

BTW OP, can we have a logo? Seeing the dull standard EXE icon on the desktop isn't cool.


----------



## agent00skid (May 19, 2013)

My N830 at 1,5 Ghz in my laptop took twice as long. So on my end, it's seems to scale appropriately.


----------



## cheesy999 (May 20, 2013)

TRWOV said:


> Maybe memory bandwidth plays a role too?
> 
> BTW OP, can we have a logo? Seeing the dull standard EXE icon on the desktop isn't cool.



I'm on single channel, we should explore this.


----------



## ovidiutabla (Jun 6, 2013)

Unified benchmark. Multi Core PI @ LINPACK.

Native UI [no more .Net / WPF]. Only Visual C++ / MFC / Win32API.

*Final Release*







*Download link*

http://www.pcgamingxtreme.ro/forum/download/file.php?id=666


----------



## Arctucas (Jun 6, 2013)




----------



## Punisher! (Jun 7, 2013)

t.phase said:


> Unified benchmark. Multi Core PI @ LINPACK.
> 
> Native UI [no more .Net / WPF]. Only Visual C++ / MFC / Win32API.
> 
> ...



It doesn't work at all here. It always tells me that I have to do the right benchmark to send it to HWBot but I *AM DOING THE RIGHT DAMN ONE*. No screenshot, no datafile... really confused (result @output is ok)!


----------



## Aquinus (Jun 7, 2013)

I like the newer look. It feels more professional to me and less like Metro. 

I don't know how I feel about the calendar there though. Other than showing your the current date, I don't know how it's relevant to benchmarking considering you can't do anything with it either.

I would say ditch the side bar with the calendar.

Other than being more aesthetically pleasing (at least for me,) I don't feel that the usability or the way the information is being presented makes the application any easier or harder to use so from a UI standpoint I'm indifferent.


----------



## Punisher! (Jun 7, 2013)

NP! I think none was using it to send scores because of that bug? 

Tonight I will test it .


----------



## ovidiutabla (Oct 23, 2013)

Software update:











New download link:

http://www.pcgamingxtreme.ro/forum/download/file.php?id=701


----------



## CrackerJack (Dec 6, 2013)

Typo: Ellapsed = Elapsed 
And it that milliseconds?


----------



## Arctucas (Dec 6, 2013)

ovidiutabla said:


> What do you think about the new UI:





Which version is that?


----------



## Arctucas (Dec 8, 2013)

Thanks.


----------



## ovidiutabla (Mar 4, 2014)

*Multi Threaded PI @ LINPACK v6.0*











*New Download Link*

PI:
https://www.dropbox.com/s/zqgja04a9i4ihyh/Multi Threaded PI Ultimate.zip

LINPACK:
https://www.dropbox.com/s/8v4cm0jf13j5wln/Multi Threaded LINPACK Ultimate.zip

*Video Presentation*


----------



## klva80 (Mar 7, 2014)

ohh links is down


----------



## klva80 (Mar 7, 2014)

o its already fixed :_)


----------

