 Intel i850 chipset components comprise the i82850 MCH and the 82801BA ICH with integrated UATA/100 and network controller
The main difference is the i82850 MCH, featuring a dual Rambus channel interface like the I 840 chipset but tripling the chipset to CPU interface by using a quad data rate (QDR) protocol at 100 MHz clock speed to provide an effective data rate of 400 Mbit per pin and second. Based on a 64-bit wide data path, the quad data rate interface, therefore, has a peak bandwidth of 3.2 GB/second.
We all know by now that peak bandwidth is rather irrelevant when it comes to actual performance. More critical are latencies, especially at higher bandwidth. Rambus memory is, unfortunately, stricken with very high latencies and, therefore, the chipset itself or else the CPU have to provide a means of compensation for the performance penalty. Intel is certainly not blind and the latency issues may be downplayed but are certainly taken serious in the background. One way to compensate for access penalties is to use prefetching of anticipated data and to incorporate a so-called in order queue into the chipset.
The IOQ
What hides behind the term in-order queue is a small amount of cache within the chipset, serving as a pipeline to buffer outstanding transactions. In other words, we are looking at a prefetch buffer, the efficiency of which (since the data have to be in order) highly depends on the locality and order of the data arriving.
The 4 level IOQ in the VIA 694X-based boards has been the first instance where performance measurements were possible and even using a relatively low latency SDRAM interface, the performance delta between a four transaction prefetch as opposed to a non-prefetch (IOQ level = 1) was in the order of approximately 10 % in graphics applications (Tyan trinity review).
With the i850 MCH, Intel has taken the IOQ one step further by extending the depth to 8 levels, therefore allowing eight outstanding transactions to be prefetched. Through this buffer, initial latencies can be masked since the data are already prefetched in the buffer and, therefore, are available without penalty cycles.
The drawback of an IOQ is that it does depend on the locality of data stored in the system memory. Prefetch can be done only for data that are stored within the same page (or memory row), since it involves adjacent column addresses that can only be captured if the row is already open. On the other hand, since the P4 also has an extremely deep pipeline of 20 levels, the thought of complementing it by another prefetch buffer makes sense. If there is a page miss, the entire scheme is hosed anyway so it doesn't matter too much if the IOQ needs to be cleared along with the CPU pipelines. Intel, however are claiming approximately 80% page hits (of all memory accesses) depending on how well their branch prediction algorithms are working.
Whether this number is correct or not largely depends on the application, unless there is software specifically written or optimized for the SSE2 instructions of the P4, real life numbers are certainly much lower. Reason enough to cause some caution about the overall system performance of the P4. In other words, we are looking at a highly optimized system capable of delivering outstanding performance under optimized conditions but as soon as anything goes wrong, the recovery latencies are quite brutal. There are a few design tricks to ameliorate the problems associated with the hyperpipelined architecture, referred to by Intel as NetBurst Microarchitecture featuring Advanced Dynamic Execution (out of order execution), Rapid Execution Engine (integer units are running at double clock speed), 400 MHz data bus and so on.
More articles on this subject from competent contemporaries like Paul DeMone with his column can be found on RealWorldTechnologies, so this should be left here, instead, let's jump right over to the candidate in question: The ASUS P4T.
|