Navigate:

Advice
Beginners
BIOS Guide
CPUs
Links
Mainboards
Memory
Network
Storage
Video/Sound Cards

Contact
Forum
SiteMap
Sponsors
WebNews
Home
. .

Prices:

Mainboards

ABIT
ASUS
Chaintech
Shuttle
Soyo
Tyan

CPU
Intel
P4 2.4C-800
P4 2.6C-800
P4 2.8C-800
P4 3.0-800
P4 3.2-800

AMD
AthlonXP
XP 1700+
XP 2000+
XP 2400+
XP 2500+
XP 2700+
XP 3000+
XP 3200+

Athlon64
Athlon64 3200+
Athlon64 FX-51

Opteron
Opteron 240
Opteron 242
Opteron 244
Opteron 246

Memory

Corsair
Crucial
Kingston
Mushkin
OCZ

Search Prices:








































































































































What are you
shopping for?



































































































































































LOSTCIRCUITS

SHORTCUTS:
The Ultimate Dream Machine
Snoops and Filters
Intels Snoop Filter Implementation
SkullTrail Extreme D5400SX
System Configurations
Power Play
3D Rendering and Energy Efficiency
Cinebench
DVD-Shrink, MainConcept
VirtualDub / DivX 6.8
3DMark'06
World In Conflict
Crysis
F.E.A.R.
UnrealTournament3
The Final Analysis

Give Us Some Feedback on this Article

 Intel's SkullTrail Extreme Platform
Playground of the Titans
(Author: Michael Schuette, February 10, 2008)

SMP and Cluster-Snooping

To recapitulate, we are up to two CPUs with independent host buses, two memory controllers talking to two different branches of FBDIMMs for the heart and soul of the system. Needless to say that, especially with four cores on each CPU, parallel processing is not as easy as just having the two processors running in tandem. Reasons are that each CPU has its own dedicated high-speed internal memory also known as cache for fast access of data and the situation is further complicated by the different branches of memory and separate host buses, as much as they are convenient for increasing bandwidth.

The issues of cache and memory coherency have been discussed in so many articles that we may skip the details here other than stressing the fact that each physical and logical core needs to know what the other ones are doing and what data they have in their cache in order to avoid multiplicities of the same workload or working on outdated data.

Cache coherency is usually achieved by a process called snooping, which means in simple terms that any data request is broadcasted to all caches of all processors to see whether any one of the latter has the data already stored and if so, what the status flag of those data is according to the MESI (Modified Exclusive Shared or Invalid) or MOESI (Modified Owned Exclusive Shared or Invalid) protocols used by Intel or AMD, respectively. In a two-way SMP system, this is relatively straightforward, one CPU simply snoops the other and proceeds accordingly with processing. In a four way SMP system, things get a little more complicated in that every CPU has to snoop out three others. In an eight-way system, it is 7 other CPUs that are constantly subjected to snooping. Keep in mind here that it is not just one CPU that does the snooping, rather, each CPU has to do it which results in a regular cluster-snoop. In case somebody has difficulties visualizing this scenario, it is kind of analogous to dogs sniffing (or snooping) each other out. The main difference between dogs and CPUs is, that at any time, only one CPU can snoop out the others, kind of like the dogs having to take a number and wait their turns.

In a nutshell, in the aforementioned “cluster-snoop” where everybody checks out everybody else, the snoop traffic can be described as roughly the square of the number of CPUs or cores in the cluster. Any 4-way system requires four times the snoop traffic of a 2-way system (16/4) and an 8-way system builds up a rush-hour of 16 times the traffic of a 2-way configuration (64/4).

Short of using a dedicated snoop sideband, the only way to perform snooping is by using the same address bus that interfaces the CPU(s) with the host system and it does not require a rocket scientist to figure out that at some point, the snoop traffic will become a major clog in the overall system communication. Main frame computers have faced this problem already for a bit longer and it is no surprise that a possible solution was developed by IBM a few years ago in the form of what is called a “Snoop Filter”. In short, IBM uses eDRAM, a specialized form of DRAM with integrated SRAM for fast access as snoop filter to reduce traffic between the different bus segments.

In trivial terms, if a CPU needs data that are not in its cache, i.e., a cache miss occurs, the CPU puts out a snoop on the bus including the snoop filter. If the snoop misses the snoop filter (the respective data are not catalogued in the snoop filter), the request is directly routed to the main memory for a Read. If the snoop hits the snoop filter, that is, it shows that the target cache line may exist in another CPU, the request is propagated to the other bus segments to see whether any other processor still has the data cached. If the data are no longer in that cache, the request is rerouted to system memory for data access. In other words, instead of hitting every cache of each processor, the snoop filter provides a master table for all CPUs of what data are in which cache line and each CPU only needs to access the snoop filter rather than all companion CPUs for the initial status query.

next page: => Intel's Snoop Filter =>

All advice and educational articles on LostCircuits are free, but if you feel you can, please make a small donation to us!
Thank you!

General disclaimer: This page only reflects the author's personal opinion and assumes no responsibility whatsoever regarding any of the contents or any damages that may occur explicitly or implicitly from reading the contents of this site. All names and trademarks mentioned in this review are the exclusive property of the respective parent companies.
All contents of this site are protected by international copyright laws. Reproduction of the contents even in parts is not allowed except after written permission by the author and referral to this site.
Copyright 2002 - 2008 LostCircuits