|
Advice Beginners BIOS Guide CPUs Links Mainboards Memory Network Storage Video/Sound Cards Contact Forum SiteMap Sponsors WebNews Home |
. | . |
Prices: Mainboards ABIT ASUS Chaintech Shuttle Soyo Tyan CPU Intel P4 2.4C-800 P4 2.6C-800 P4 2.8C-800 P4 3.0-800 P4 3.2-800 AMD AthlonXP XP 1700+ XP 2000+ XP 2400+ XP 2500+ XP 2700+ XP 3000+ XP 3200+ Athlon64 Athlon64 3200+ Athlon64 FX-51 Opteron Opteron 240 Opteron 242 Opteron 244 Opteron 246 Memory Corsair Crucial Kingston Mushkin OCZ |
LOSTCIRCUITS
|
|
| Intel Pentium4 "Prescott" Strained to the Silicon | |
|
(Review by MS, Feb. 1, 2004) |
|
Intel Prescott Starting at: |
L2 Cache
The L2 cache of the Prescott core has been increased to a unified 1 MB writeback 8-way set associative L2 cache organized in 128-byte lines. The increased cache size can be used for a variety of different prefetch mechanisms to decrease memory latencies by using idle bus periods to allow software prefetch instructions to not only prefetch data, but also to initiate page table walks and to allow data translation lookaside buffer (DTLB) fills if the prefetch access is to a page currently not cached in the TLB. This way it is possible to prefetch page table entries into the data translation lookaside buffers. The software prefetching instructions are complemented by a hardware prefetching scheme. The advantage of any hardware prefetcher is that it works independent of the software through analyzing data streams and consequent prediction of the next target.
HyperThreading Enhancements
Previous P4 iterations faced a number of constraints as to e.g. the number of simultaneously outstanding stores that was limited to 24 and has now been increased to 32, which, in turn, made it necessary to increase the write combine buffers from 6 to 8. Likewise, the number of outstanding loads (reads) that are L 1 cache misses has increased from 4 to 8. For any single processor system, this may hardly matter, however, in the case of two logical processors enabled via HyperThreading, these limitations could bear some real performance issues.
HyperThreading also generates an entirely new set of issues, that are related to the early identification of cache hits by a partial address match. Basically, what it comes down to is the fact that the two logical processors can and will generate similar sets of partial addresses that will lead to competing for the same cache address and cause what is called cache contention. The primary issue here is that any content has historically been identified by the address tag and the more processors are running in the same system, the higher is the probability that the same partial address fragments will be generated even if the target locations are far away from each other. To avoid the type of cache contention arising from this issue, a context identifier bit has been added that can be dynamically turned on or off to allow e.g. data sharing between the two logical processors, in case the partial address matches are not so partial after all.
SSE3 Instructions
MMX, SSE, SSE2 and SSE3 instructions have one thing in common, they are instructions that are designed to reduce instructions. In the case of Prescott, thirteen new instructions have been added that are targeting exactly this goal. The probably most effective example is the Fisttp or Bobbit instruction to convert x87 floating point values into integers. Where the conventional conversion includes use of the floating point control word (FCW) for finding the correct rounding algorithm, Bobbit allows to completely ignore FCW to always use chop (as its rounding mode). An extreme example would be:
Code without SSE3:
Code with SSE3:
This may not be the most elegant and precise solution but, for certain, it is one of the fastest imaginable.
Graphics Optimizations
Common SSE instructions do not work well with current graphics. The reason is in the very nature of SSE that applies the same instructions to different data. Graphics are usually organized as vector4 structures, meaning that the vertices can be described as Arrays Of Structures (AOS) where each structure contains different scalar values (x, y, z, a) that are not subjected to the same calculations. Therefore, SSE or SSE2 instructions are not capable to play out their full potential in this type of operation environment.
If the vertices are, however, reordered, so that instead of a "by vertex" basis, the calculations are preformed on a "by scalar" basis, meaning that all "x" values within a given range of vertices are processed by the same instruction, followed by all "y" and so on, the efficacy of the SSE instructions will increase substantially. This type of data would be referred to as Structure of Arrays (SOA) as opposed to the AOS scheme outlined above. In reality, instead of reordering the data, a "horizontal floating point addition / subtraction" scheme is easier to implement and this is another part of the added SSE3 instruction set.
The above was a trivialized abridged version of the hilights of the Prescott changes that will lead to improved performance in the right applications. The big question, however, is after all, how it will perform in the average garden variety applications that everybody is using, especially, when compared to the Northwood. Also where does the Prescott stand with respect to the competition in form of the phalanx of AMD processors?
next page: => Test Configurations =>
All advice and educational articles on LostCircuits are free, but if you feel you can, please make a small donation to us!