KTE wrote:justapost wrote:I tried that benchmark at home with my 5050e and 2GB DDR2 1066 under 64bit debian lenny. With gcc 4.3.4 i get ~3MB/s with the latest icc ~6MB/s (autopar enabled by default I assume).
I think you mean GB/s
Yepp.
Here's what I get under Win 7 64b vs. Win XP 32b (K8 X2 at 2.0GHz / DDR2-800 4-3-4-9-16). 256MB gone to IGP, 2GB gone to RAMDisk, default binary.
stream.png
[/quote]
Odd, I expected the results are bandwidth per core and not the sum of all threads running. I ran single threaded and thought 3GB/s are the max per core, so two (autopared with icc) can reach 6GB/s.
Will have to figure out how to make an openmp version under linux to compare gcc vs icc again.
Completey wrong assumption on my side, this one hurts.
KTE wrote:B/W should go up as accessing core number increases, unless B/W reaches a bottleneck and that bottleneck will hamper any hig-end server workload from scaling as all the cores fight for it. Istanbuls HT Assist does enable higher STREAM B/W ... actually I've just read AMD Blog that HT Assist is disabled on 2S servers, no wonder they get no benefit.
HT Assist, or the Probe Filter as it is sometimes called, works by using part of the processor's L3 cache as a directory cache. This directory cache tracks all cache lines cached in the system. Instead of generating numerous cache probes when checking a cache line the processor does a Probe Filter Lookup. This helps lower latency for accesses to local DRAM because there is no need to wait for probe responses when accessing local data. This also means there is less queuing delay due to the lower HyperTransport technology traffic. With significantly reduced probe traffic it effectively also increases system bandwidth performance. It also should be noted that the directory cache uses 1MB of the 6MB L3 cache in the case of the Six-Core AMD Opteron processor. As well, HT Assist is only enabled on 4-socket and 8-socket systems, where the performance benefits largely outweigh the small decrease in available L3 data cache. On the other hand, HT Assist is not enabled on 2-socket systems where there is much less cache probe traffic and the full L3 cache is utilized.
We've measured the difference of HT Assist on Six-Core AMD Opteron processors and the results are nothing but stunning. On the same 4-socket system, we measured 42GB/s of memory bandwidth with the STREAM benchmark with HT Assist, while only getting 25.5GB/s when HT Assist is disabled.* For 4-socket and 8-socket Six-Core AMD Opteron processor-based systems, this can translate into a significant performance uplift for applications that depend on cache performance, memory bandwidth, and system scalability.
So on the dual socket istanbul setup the cache probes between the two cpu's increase and there is less L3 cache available for each core compared to the dual shanghai setup. Seems 20-21GB/s is the limit for 2 x dual channel DDR2 800. Magny cours will reach this bandwidth with a single sockets solution and DDR3 1066 and a dual socket system of those mcm chips will hit 42GB/s or above 50GB/s with DDR3 1333.
On the other hand those specs say 85.6GB/s dram bandwidth for dual g34 systems.
http://www.semiaccurate.com/2009/08/24/ ... ocket-g34/But on the intel side nehalem has triple DDR3 for four cores but will only have four DDR3 channels for eight cores.
So if IPC is the limit magny cores might have an advantage due to the higher core count and if memory is the limit intel's current advantage will shrink (on two and four socket systems).
Update: Here are my results 5050e 2.6GHz 2GB 1066 5-5-5-15
First and second used a binary build with gcc-4.3.4 and -O2 -fopenmp, third one was buidl with gcc-4.3.4 but with no optimizations, last one used icc 11.1 with -O2 -openmp.



