|
Advice Beginners BIOS Guide CPUs Links Mainboards Memory Network Storage Video/Sound Cards Contact Forum SiteMap Sponsors WebNews Home |
. | . |
Prices: Mainboards ABIT ASUS Chaintech Shuttle Soyo Tyan CPU Intel P4 2.4C-800 P4 2.6C-800 P4 2.8C-800 P4 3.0-800 P4 3.2-800 AMD AthlonXP XP 1700+ XP 2000+ XP 2400+ XP 2500+ XP 2700+ XP 3000+ XP 3200+ Athlon64 Athlon64 3200+ Athlon64 FX-51 Opteron Opteron 240 Opteron 242 Opteron 244 Opteron 246 Memory Corsair Crucial Kingston Mushkin OCZ |
|
|
|
LOSTCIRCUITS
|
|
| Intel's SkullTrail Extreme Platform Playground of the Titans | |
|
(Author: Michael Schuette, February 10, 2008) |
SMP and Cluster-Snooping
To recapitulate, we are up to two CPUs with independent host buses, two memory controllers talking to two different branches of FBDIMMs for the heart and soul of the system. Needless to say that, especially with four cores on each CPU, parallel processing is not as easy as just having the two processors running in tandem. Reasons are that each CPU has its own dedicated high-speed internal memory also known as cache for fast access of data and the situation is further complicated by the different branches of memory and separate host buses, as much as they are convenient for increasing bandwidth.
The issues of cache and memory coherency have been discussed in so many articles that we may skip the details here other than stressing the fact that each physical and logical core needs to know what the other ones are doing and what data they have in their cache in order to avoid multiplicities of the same workload or working on outdated data.
Cache coherency is usually achieved by a process called snooping, which means in simple terms that any data request is broadcasted to all caches of all processors to see whether any one of the latter has the data already stored and if so, what the status flag of those data is according to the MESI (Modified Exclusive Shared or Invalid) or MOESI (Modified Owned Exclusive Shared or Invalid) protocols used by Intel or AMD, respectively. In a two-way SMP system, this is relatively straightforward, one CPU simply snoops the other and proceeds accordingly with processing. In a four way SMP system, things get a little more complicated in that every CPU has to snoop out three others. In an eight-way system, it is 7 other CPUs that are constantly subjected to snooping. Keep in mind here that it is not just one CPU that does the snooping, rather, each CPU has to do it which results in a regular cluster-snoop. In case somebody has difficulties visualizing this scenario, it is kind of analogous to dogs sniffing (or snooping) each other out. The main difference between dogs and CPUs is, that at any time, only one CPU can snoop out the others, kind of like the dogs having to take a number and wait their turns.
In a nutshell, in the aforementioned “cluster-snoop” where everybody checks out everybody else, the snoop traffic can be described as roughly the square of the number of CPUs or cores in the cluster. Any 4-way system requires four times the snoop traffic of a 2-way system (16/4) and an 8-way system builds up a rush-hour of 16 times the traffic of a 2-way configuration (64/4).
Short of using a dedicated snoop sideband, the only way to perform snooping is by using the same address bus that interfaces the CPU(s) with the host system and it does not require a rocket scientist to figure out that at some point, the snoop traffic will become a major clog in the overall system communication. Main frame computers have faced this problem already for a bit longer and it is no surprise that a possible solution was developed by IBM a few years ago in the form of what is called a “Snoop Filter”. In short, IBM uses eDRAM, a specialized form of DRAM with integrated SRAM for fast access as snoop filter to reduce traffic between the different bus segments.
In trivial terms, if a CPU needs data that are not in its cache, i.e., a cache miss occurs, the CPU puts out a snoop on the bus including the snoop filter. If the snoop misses the snoop filter (the respective data are not catalogued in the snoop filter), the request is directly routed to the main memory for a Read. If the snoop hits the snoop filter, that is, it shows that the target cache line may exist in another CPU, the request is propagated to the other bus segments to see whether any other processor still has the data cached. If the data are no longer in that cache, the request is rerouted to system memory for data access. In other words, instead of hitting every cache of each processor, the snoop filter provides a master table for all CPUs of what data are in which cache line and each CPU only needs to access the snoop filter rather than all companion CPUs for the initial status query.
next page: => Intel's Snoop Filter =>
All advice and educational articles on LostCircuits are free, but if you feel you can, please make a small donation to us!