Navigate:

Advice
Beginners
BIOS Guide
CPUs
Links
Mainboards
Memory
Network
Storage
Video/Sound Cards

Contact
Forum
SiteMap
Sponsors
WebNews
Home
. .

Prices:

Mainboards

ABIT
ASUS
Chaintech
Shuttle
Soyo
Tyan

CPU
Intel
P4 2.4C-800
P4 2.6C-800
P4 2.8C-800
P4 3.0-800
P4 3.2-800

AMD
AthlonXP
XP 1700+
XP 2000+
XP 2400+
XP 2500+
XP 2700+
XP 3000+
XP 3200+

Athlon64
Athlon64 3200+
Athlon64 FX-51

Opteron
Opteron 240
Opteron 242
Opteron 244
Opteron 246

Memory

Corsair
Crucial
Kingston
Mushkin
OCZ

Search Prices:


























































































































LOSTCIRCUITS

SHORTCUTS:
Top Page
Strained Silicon
Die (re)Organization
Prescott New Features I
Prescott New Features II
Test Setups
Latencies
Sysmark 2004
Internet Content Creation
Office Productivity
3ds max 5.1
Amorphium
Cinebench 2003
D3D Gaming
=> OpenGL
Thermal Issues
Overclocking,
Final Words

Prescott 3.0 Pricing

Give Us Some
Feedback to
Improve our Reviews

 Intel Pentium4 "Prescott"
Strained to the Silicon
(Review by MS, Feb. 1, 2004)
Intel Prescott
Starting at:
Prescott Enhancements

L2 Cache

The L2 cache of the Prescott core has been increased to a unified 1 MB writeback 8-way set associative L2 cache organized in 128-byte lines. The increased cache size can be used for a variety of different prefetch mechanisms to decrease memory latencies by using idle bus periods to allow software prefetch instructions to not only prefetch data, but also to initiate page table walks and to allow data translation lookaside buffer (DTLB) fills if the prefetch access is to a page currently not cached in the TLB. This way it is possible to prefetch page table entries into the data translation lookaside buffers. The software prefetching instructions are complemented by a hardware prefetching scheme. The advantage of any hardware prefetcher is that it works independent of the software through analyzing data streams and consequent prediction of the next target.


HyperThreading Enhancements

Previous P4 iterations faced a number of constraints as to e.g. the number of simultaneously outstanding stores that was limited to 24 and has now been increased to 32, which, in turn, made it necessary to increase the write combine buffers from 6 to 8. Likewise, the number of outstanding loads (reads) that are L 1 cache misses has increased from 4 to 8. For any single processor system, this may hardly matter, however, in the case of two logical processors enabled via HyperThreading, these limitations could bear some real performance issues.

HyperThreading also generates an entirely new set of issues, that are related to the early identification of cache hits by a partial address match. Basically, what it comes down to is the fact that the two logical processors can and will generate similar sets of partial addresses that will lead to competing for the same cache address and cause what is called cache contention. The primary issue here is that any content has historically been identified by the address tag and the more processors are running in the same system, the higher is the probability that the same partial address fragments will be generated even if the target locations are far away from each other. To avoid the type of cache contention arising from this issue, a context identifier bit has been added that can be dynamically turned on or off to allow e.g. data sharing between the two logical processors, in case the partial address matches are not so partial after all.

SSE3 Instructions

MMX, SSE, SSE2 and SSE3 instructions have one thing in common, they are instructions that are designed to reduce instructions. In the case of Prescott, thirteen new instructions have been added that are targeting exactly this goal. The probably most effective example is the Fisttp or Bobbit instruction to convert x87 floating point values into integers. Where the conventional conversion includes use of the floating point control word (FCW) for finding the correct rounding algorithm, Bobbit allows to completely ignore FCW to always use chop (as its rounding mode). An extreme example would be:

Code without SSE3:

Code with SSE3:

This may not be the most elegant and precise solution but, for certain, it is one of the fastest imaginable.

Graphics Optimizations

Common SSE instructions do not work well with current graphics. The reason is in the very nature of SSE that applies the same instructions to different data. Graphics are usually organized as vector4 structures, meaning that the vertices can be described as Arrays Of Structures (AOS) where each structure contains different scalar values (x, y, z, a) that are not subjected to the same calculations. Therefore, SSE or SSE2 instructions are not capable to play out their full potential in this type of operation environment.

If the vertices are, however, reordered, so that instead of a "by vertex" basis, the calculations are preformed on a "by scalar" basis, meaning that all "x" values within a given range of vertices are processed by the same instruction, followed by all "y" and so on, the efficacy of the SSE instructions will increase substantially. This type of data would be referred to as Structure of Arrays (SOA) as opposed to the AOS scheme outlined above. In reality, instead of reordering the data, a "horizontal floating point addition / subtraction" scheme is easier to implement and this is another part of the added SSE3 instruction set.

The above was a trivialized abridged version of the hilights of the Prescott changes that will lead to improved performance in the right applications. The big question, however, is after all, how it will perform in the average garden variety applications that everybody is using, especially, when compared to the Northwood. Also where does the Prescott stand with respect to the competition in form of the phalanx of AMD processors?

next page:    => Test Configurations =>

All advice and educational articles on LostCircuits are free, but if you feel you can, please make a small donation to us!
Thank you!

General disclaimer: This page only reflects the author's personal opinion and assumes no responsibility whatsoever regarding any of the contents or any damages that may occur explicitly or implicitly from reading the contents of this site. All names and trademarks mentioned in this review are the exclusive property of the respective parent companies.
All contents of this site are protected by international copyright laws. Reproduction of the contents even in parts is not allowed except after written permission by the author and referral to this site.
Copyright 2002 - 2008 LostCircuits