|
Page 2 of 13
A Brief Architectural Overview: The Transition From G80 to GT200
With the G80 GPU, nVidia has laid the groundwork to their next generation GPU, however, this first foray into the new architecture still left some headroom for tweaking and improvements. The GT200 GPU takes the architecture to the next level by implementing some quantitative changes in the ratio between the individual building blocks and by overall increasing the number of "processing units" of any sort and kind. There are two excellent articles on the subject on RealWorldTech and TomsHardwareGuide, so we will only give the abridged version here.
With the GT200 series of GPU, nVidia introduces an evolution of the classic SIMD (single instruction, multiple data) feature, name in this case SIMT for single instruction, multiple threads. The key difference to the SIMD architecture is that the vector processed has no predefined width. For a better understanding, in the previous architecture, the rasterizer used to generate quads, that is a block of 2 x 2 pixels defined by four single precision floating point (RGBA) vectors each (resulting in a total of 16 vectors). The ALU then processed the quads in 16 way SIMD mode, that is, the same instruction was applied to all 16 vectors or, as in the case of the GF6 and GF7 series, a second instruction could be processed simultaneously through a co-issue feature. This type of data alignment is referred to as AoS, short for array of structures where each quad was a structure.
G80 - GT200 architecture comparison, courtesy of RWT.
In contrast, the GT200 design takes a total of 8 quads or 32 pixels called a “warp” and processes them in SoA (structure of arrays) mode, that is, instead of processing RGBA-RGBA-RGBA, the new sequence is RRRRRRRRGGGGGGGG … etc. The key improvement is higher efficiency of the SIMT units since it no longer depends on all four vectors in a pixel receiving the same instructions. With respect to the “macro structure”, the changes from the G80 to the GT200 are primarily quantitative in nature. The number of texture processor clusters has increased from 8 to 10 in addition each TPC contains now three streaming multiprocessors (SM) plus one texture unit in contrast to the G80 which only featured two streaming multiprocessors per TPC.

GT200 dieshot, courtesy of RWT
The streaming multiprocessors themselves have also been reworked internally, each now being capable of holding 1024 active threads as opposed to the 768 threads on the G80’s SMs. In terms of warps, this results in an increase from 24 to 32 (32 thread) warps per SM. Simultaneously, the number of registers has doubled from 8192 to 16382, which in toto increases the number of registers/thread from 10 to 16. In other words, each thread can simultaneously use 16 registers now, a 60% increase from the G80 architecture. Another improvement on the GT200 architecture is the dual issue mode, which is really parallel processing across two separate units, in this case the FPU unit and a special function unit. The result is reduced latency by interpolating MUL and MAD operations by toggling between the SFU and the FPU, for example as preamble to running long transcendental instructions.
|