Navigate:

Advice
Beginners
BIOS Guide
CPUs
Links
Mainboards
Memory
Network
Storage
Video/Sound Cards

Contact
Forum
SiteMap
Sponsors
WebNews
Home
. .

Prices:

Mainboards

ABIT
ASUS
Chaintech
Shuttle
Soyo
Tyan

CPU
Intel
P4 2.4C-800
P4 2.6C-800
P4 2.8C-800
P4 3.0-800
P4 3.2-800

AMD
AthlonXP
XP 1700+
XP 2000+
XP 2400+
XP 2500+
XP 2700+
XP 3000+
XP 3200+

Athlon64
Athlon64 3200+
Athlon64 FX-51

Opteron
Opteron 240
Opteron 242
Opteron 244
Opteron 246

Memory

Corsair
Crucial
Kingston
Mushkin
OCZ

Search Prices:


























































































































LOSTCIRCUITS

SHORTCUTS:
Top Page
Strained Silicon
Die (re)Organization
Prescott New Features I
Prescott New Features II
Test Setups
Latencies
Sysmark 2004
Internet Content Creation
Office Productivity
3ds max 5.1
Amorphium
Cinebench 2003
D3D Gaming
=> OpenGL
Thermal Issues
Overclocking,
Final Words

Prescott 3.0 Pricing

Give Us Some
Feedback to
Improve our Reviews

 Intel Pentium4 "Prescott"
Strained to the Silicon
(Review by MS, Feb. 1, 2004)
Intel Prescott
Starting at:
Prescott Enhancements

Pipeline Length

The most overt difference between Northwood and Prescott appears to be the actual pipeline length to bring data to the processor core. Depending on the counting scheme applied, different numbers are floating around but on a 1:1 comparison basis, if Northwood pipeline stages are counted as 20, then the equivalent number of Prescott pipeline stages is 31. This increase in pipeline stages means that the number of clock cycles for data and instructions to reach the core has increased from 20 clock cycles to 31 cycles. In general, if nothing goes wrong, there is very little difference between a long and a short pipeline, however, if the wrong data have been predicted and are speculatively preloaded into the pipeline, these data have to be processed along the entire length of the pipeline as well before they can be evicted at the back-end. This naturally causes a bubble or delay and the size of the bubble or delay increases with the length of the pipeline.


Force Forwarding

In earlier steppings of the P4 feature a Load-to-Store forwarding mechanism determines whether there is a partial match between the cached data and those in the store forwarding buffer (SFB), using address comparator algorithms. If the cached data are older, then the newer data are pulled directly from the SFB for further processing after which the L1 cache is updated accordingly. There are possibilities that this mechanism does not work as advertised, as not all partial address matches will refer to the same data, likewise, an address misalignment may occur. The latest addition to this compare and update mechanism is, therefore, the so-called Force Forwarding which allows the Memory Ordering Buffer (MOB) through a forwarding-entry-selection multiplexing scheme to override the SFB selection decision and, thus, avoid wrong decisions early in the pipeline.

The obvious question is of course why even use a partial address match if it can cause problems, and the answer is rather simple. The smaller the partial chunk of address needed for the identification of the correct data in the cache is, the earlier will it be available. Therefore, the access latencies can be reduced on the basis of a partial "glimpse into the future", especially when it comes to determining whether the data are found within the L1 cache or whether there will be a "cache miss". If the partial address required increases in size, the probability for a correct match will naturally increase, too but this will be bought with a somewhat higher latency. In the end, it is like "Wheel Of Fortune", there is always a chance to wait until all characters are displayed but there is also a chance to cut ahead and come up with the right solution based on partial match and speculation.

Speaking of the Level 1 cache, the Willamette, the Northwood and the Gallatin core used in the ExtremeEdition feature a 4-way set-associative Level 1 data cache. In the case of the Prescott core, the L1 size has been increased to 16 kB and the associativity has been adjusted to 8-way.

Execution Units

The Integer execution units are largely unchanged, however, a shifter / rotator block was added to one of the ALUs. In addition, all P4 cores so far had to use the floating point unit in order to execute integer multiplications which resulted in extra latencies since the source operands needed to be moved from the ALU units to the FP side and the results had to be transferred back to the integer units. This issue has been solved with the addition of a dedicated integer multiplier.

next page:    => Prescott Enhancements II =>

All advice and educational articles on LostCircuits are free, but if you feel you can, please make a small donation to us!
Thank you!

General disclaimer: This page only reflects the author's personal opinion and assumes no responsibility whatsoever regarding any of the contents or any damages that may occur explicitly or implicitly from reading the contents of this site. All names and trademarks mentioned in this review are the exclusive property of the respective parent companies.
All contents of this site are protected by international copyright laws. Reproduction of the contents even in parts is not allowed except after written permission by the author and referral to this site.
Copyright 2002 - 2008 LostCircuits