Architecture
NVIDIA's new GT200 GPU is very similar to the G80/G9x architecture, but numerous improvements have been made.
The ten bigger blocks you see in above picture are called TPCs now. Depending on whether the GPU runs in graphics processing or computing mode this stands for "Texture Processing Cluster" or "Thread Processing Cluster". Even though they have different meanings in both modes, the same silicon is used, just in a different way.
Compared to the G80, the number of clusters has been increased from eight to ten. Also the number of streaming multiprocessors (big green blocks in the cluster) went from two to three. The number of streaming processors (small green blocks) has remained the same at eight. In total this results in almost double the amount of stream processors with 240, vs. 128 on the last generation.
The render backend has also been beefed up somewhat. Instead of four partitions (blue blocks), the GT200 now uses eight. This effectively doubles the available memory bandwidth, but also requires double the memory chips on the board. Every partition connects to a single memory chip with a 64-bit wide bus. As a result the total memory bus width of this card is 512-bit. This dramatically increases the PCB cost since routing 512 signal lines is a lot more complex than routing 256 lines. While ATI is using GDDR5 on their future board designs which offers double the bandwidth at the same bus width, NVIDIA has remained with GDDR3 for now, but I assume this might change in the near future since GDDR5 has a major potential to reduce the total board cost, heat output and power draw (eight memory chips need less power than 16).
Other architectural changes include a larger register file which basically allows more complex shaders or more efficient shader execution. Imagine you have to do a complex calculation with several variables, each variable would be stored in a register. But since the number of registers is limited you might run out of registers at one point during complex calculations. In this case you would have to swap out one register to graphics memory. Accesses to the graphics memory are much slower than to registers. Once you are done with the complex part of your calculation you have to get that data out of the graphics memory again, an additional performance hit.
An important feature for general computation on the GPU is the added support for 64-bit double precision floating numbers. Single precision floating numbers use 32-bit to store data which might cause them to lose some precision during calculation. If you do a lot of calculations after each other, the error will add up becoming more and more significant. While double precision is not infinitely precise for floating point numbers it is as good as what is available in today's CPUs. Please note that application developers will actively need to change their code to benefit from this improvement.