To demonstrate the performance difference, a Quad Core system was tested against a Core 2 Extreme using CineBench. CineBench renders a cinematic scene and measures the time it takes to complete. Since this application is optimized to take advantage of multiple CPUs, it clearly demonstrates the performance lead. The whole benchmark was completed in a bit more than half the time of the dual core system. It has to be mentioned though that single threaded applications will not benefit from four cores, actually they might run even slower because of the reduced operating frequency. Interestingly, the Quad Core CPU system started rendering a little after the Dual Core (the delay was about 1 second) - this could be due to memory throughput limitations or the fact that it takes Windows longer to start four threads.
This is Intel's Roadmap for the near future. As you can see after 65nm, the next process size will be 45nm. According to Intel their whole 45nm production process is exactly on time with the first fabs getting ready to output CPUs.
What you can also see from this road map is that all 45nm CPUs, codenamed Penryn and Nehalem will feature four cores (on one die), even in the regular versions.
A non-Extreme Quad Core CPU called Core 2 Quad is scheduled for early 2007, which means around CeBIT time. When asked, if there will be a single die, quad core design on the 65nm process Intel said that there are no such plans because 45nm production is quite ready.
Some people are worried that the existing 1066 MHz FSB speed may not be enough to feed data fast enough to four cores in one package. In some performance testing using the "galgel" benchmark, Intel found out that even four cores will not maximize the available 8.5 GB/s of the 1066 bus.
According to Intel the performance increase when going from Dual Core to Quad Core is about 70%, when using the SPECint benchmarking suite. As mentioned before, the actual speedup is highly application dependent.
For their XEON MP platform Intel has a new design coming up in Q3 07 with the Tigerton/Caneland platform. Today's Quad core MP platform "Truland" has two frontside busses which may limit the available bandwidth to the CPUs (but two slides ago Intel just showed that it's not happening?). The solution to this is to have one FSB going to each CPU, which means there a four independent FSBs on the chipset with a total transfer rate of up to 34 GB/s. This also means you have to improve the memory subsystem to be able to handle that much data.
Intel uses FB-DIMMs here that allow a total throughput of up to 32 GB/s. There was a very technical presentation about the strategies involved to achieve this bandwidth, but I will not bore you with the details.
A special feature to keep performance at an acceptable level with that many cores (up to 16) is the "Snoop Filter" which helps with the cache efficiency of this whole system. The problem with the caches is that you somehow have to make sure that changes in the data are synchronized with all the other cores. For a more in-depth read on snoop cache, go here here. Even though the article talks about older CPUs the general concept still applies.