What people are calling the tesselator on Direct3D 11 compatible hardware, are in fact 3 different things: 1. The Hull Shader, 2. The Tesselator, 3. Domain Shader
To achive the effect of tesselated geometry you have to use the 3 stages in the pipeline. The hardware tesselator in the ATI card is only the number 2 item, that fits between 2 new programable software shader stages. There is not enough info on the Nvidia card to be sure if it really has a hardware tessellator or if that stage is also executed on the programable cores using software as was implied until recently by Charlie.
To see in those benchmark graphics the drop in perfomance when mesh tesselation gets upped is not incompatible with the idea that the tesselator is working according to spec, on the ATI card, and I'll try to explain why.
The main objective of the tesselator usage, is to avoid having to do heavy vertex interpolation on animated meshes on the joints with lots of bones that are needed to do realistic animation. The same applies with doing vertex interpolation when doing mesh morphing with lots of weights in facial animation. Those are heavy calculations that would get heavily multiplied when using finer meshes with lots more of finer details. It's someting that is not linear, and doubling the vertex count on U and V directions on a square patch, would square the number of vertices that would need to be processed at later stages. This would become a bottle neck for several reasons: a) when meshes are far away, they would not need all that detail. b) most of the detail needed would be on the silhouette of the close-up objects. In the majority of triangles/quads facing the viewer, all those vertices generated by the tesselator or of using finer detail meshes would be "lost" as redundant and uncecessary, taxing the geometry processor at the following stage. So something had to be devised, and the tessellator is the needed solution to better scale into the future.
Having said that, this does not necessarily mean that using the "tesselator" (the 3 stages of it) to tessellate a coarser geometry in real-time to produce the finer detail where it's needed, and visible, will necessarily be "free" from a perfomance point of view, even if the stage 2 tesselator (the stage that ATI chip has implemented in hardware and the only fixed funcion of the 3 stages) is doing it's job for "free". The explanation lies in the fact that for it to be active, there are 2 extra programable stages that will be using the programable cores of the ship, 1) to select where detail is or is not needed (the Hull shader) and 2) to do, for instance, displacement mapping to add aditional detail where it's really needed and visible (The Domain Shader).
When using tessellation there will be two additional programmable pipeline stages doing calculations AND the fixed function tesselator in between the two. So, even if the middle one is doing work "for free" not having a perfomance penalty on the system, the other two (Hull and Domain shaders) that form part of the tesselation system, do have impact on perfomance, because they are competing for the global unified compute resources of programable processing cores. Not counting the additional bookeeping or managing FIFO queues of the new generated vertices. It needs to be balanced, but nevertheless, the cost of using the tesselator will be a lot less than sending a much detailed finer mesh, and having to animate all those irrelevant vertices in most cases. This gives selective detail only when/where needed, and allows developpers to deploy the same meshes as assets to achieve various degrees of detail depending on the computing resources of each card, from low to hi-end. Even if perfomance of using a finer detailed mesh would mean a drop to 1/4 of the FPS using the tessellator, by not using it would mean to drop to less than 1/16 the FPS.
From what I could understand of these presentation:
http://www.hardwarecanucks.com/forum...roscope-5.html
It seems that Nvidia's solution to tesselation means they are using mostly a software aproach to the 3 stages of tessellation. In the "PolyMorph engine" it's mentioned a "Tessellator", but from what I read it seems that it's hardware to improve the vertex fetching abilities. They seem to have gone from a vertex fetch per clock, on the G200b, to 8 vertices fetches per clock, using that parallel vertex fetching mechanism. It surely will improve with Tesselation by not allowing vertex fetches to be an imediate bottleneck on the system. And they will use that speed-up in 8x vertex fetching abilities to do the intermediary non-programable tessellation on sofware, which will use some of the cores to do the work that is done in hardware on the ATI's implementation.
To sumarize:
ATI : 1 software + 1 hardware + 1 software tessellation stages
Nvidia : 1 software + 1 software + 1 software tessellation stages
Nvidia compensates the lack of dedicated hardware tesselator by increasing the vertex fetch stage x8 in parallell in regard to previous generation, allowing 8 new vertices to be processed per clock.
ATI seems to do it sequentially 1 per clock (migh be wrong on this one), but does not have to allocate extra programable cores to do the fixed function part, freeing those cores to process pixels or other parts of the programable pipeline stages.
Only when Fermi is released will be possible comparisions with real usage scenarios. Only time will tell what's the best approach to that problem, and it will all depend on price/perfomance/wattage, as has been said here.
But with either solution, it would not make sense to expect constant perfomance levels (FPS) independent of the tesselation level, since there will always be the programable part of the tessellation there to steal computing resources from the other processing stages (at least if one wants to do "intelligent" selective tessellation and not by brute force).
In that sense, ATI Radeon Tessellation is NOT broken in any way. Part of it is free, the other part is not. I guess that those synthetic benchmarks like Uniengine that seem to indicate good perfomance on the Nivdia card, might be based on the fact that they are crancking up uniformly the tessellation load, and not selectively, ie, with Domain and Hull Shader off, like the ATI Radeons used to work on previous iterations. ATI opted for selective tessellation, so there might not be needed an huge vertex processing increase. Nvidia is recomending heavier tessellation, because of the paralell vertex processing (8x) it implemented, but it might come at the cost of less perfomance when doing heavy pixel computations or other work because of the decrease in remaining computational cores."