Advanced Clustering

Moderator: M_S

Advanced Clustering

Postby M_S on Fri Aug 28, 2009 5:41 pm

Some interesting articles on Istanbul vs. Nehalem:

http://www.advancedclustering.com/compa ... intel.html

http://www.advancedclustering.com/compa ... rking.html

http://www.advancedclustering.com/compa ... -2400.html

I don't know how real these guys are but it is definitely a slighly different picture than what we see painted all over the place
M_S
 
Posts: 1456
Joined: Mon Nov 03, 2008 12:25 pm

Re: Advanced Clustering

Postby KTE on Sat Aug 29, 2009 2:09 am

I linked to one of those studies in the UCLK thread because its at least some better tests than what enthusiast reviewers can do. I don't really know why desktop hardware junkies benchmark server CPUs when they have shown not to have a clue about the server world apart from workstations (Johan seems like the only one who understands it from them). These are the only independent HPC tests I've seen of Istanbul/Nehalem by a known player in the field selling both systems. Before this we only have workstation or a few 2P tests, which are entirely different segments. I've followed these guys for a while. I don't see anything wrong with their data, it looks much more legitimate than what else is around.

I actually haven't understood anything other than that since Istanbul was released, for HPC. HPC is one area where Istanbul would make decent gains, one chief reason being because AMDs Shanghai is heavily RAM and HT bottlenecked which is killing its core-to-core and CPU-to-CPU scaling. It was even worse with Barcelona. But what they don't show is that Istanbul only scales that well when there's a lower DRAM bandwidth contention. After about 8 threaded access, the bandwidth starts to fall rapidly to even below Shanghai level. HT Assist B/W benefits dissapear.

I wonder what they'll need for 16 cores of Bulldozer then. DRAM B/W will get cut down even more per core. Quad-channel DDR3-1600?
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Re: Advanced Clustering

Postby Aussie FX on Sat Aug 29, 2009 5:22 am

KTE wrote:
I wonder what they'll need for 16 cores of Bulldozer then. DRAM B/W will get cut down even more per core. Quad-channel DDR3-1600?

That would sound reasonable as it will fit into socket G34 which is quad channel. Well the Magny Cours motherboards are so that's a bit of an assumption. ;)
Aussie FX
 
Posts: 510
Joined: Sun Nov 09, 2008 3:49 am

Re: Advanced Clustering

Postby M_S on Sat Aug 29, 2009 6:52 am

KTE wrote:I linked to one of those studies in the UCLK thread because its at least some better tests than what enthusiast reviewers can do
.

Yes, I remember now, never got time to read the whole thing, this week's been a tad more than crazy.

Vincent made some interesting points about Intel's [Mul] functions and some possible code optimizations but he never posts here and does everything via IM.. where he typically loses me :P
M_S
 
Posts: 1456
Joined: Mon Nov 03, 2008 12:25 pm

Re: Advanced Clustering

Postby justapost on Sat Aug 29, 2009 12:23 pm

M_S wrote:Some interesting articles on Istanbul vs. Nehalem:

http://www.advancedclustering.com/compa ... intel.html

http://www.advancedclustering.com/compa ... rking.html

http://www.advancedclustering.com/compa ... -2400.html

I don't know how real these guys are but it is definitely a slighly different picture than what we see painted all over the place

The Linpack comparison had been discussed on xs few weeks ago, but looking at all three tests gives a better picture.
Looking at the stream benchmark the xeon system get's an 48% boost going from 800MHz to 1333MHz. Guess it's not so easy to estimate what magny cours will get with DDR3 1333. Assuming DDR2 800 equals DDR3 1066MHz, this benchmark should score around 25MB/s with dual channel DDR3 or around 30MB/s with DDR3 1600.
Looking at istanbul vs shanghai here there is only a small drop in the memory bandwidth even with DDR2 800. Now doesn't that mean six cores is the maximum before there is a memory bottle neck with dual channel DDR2 800?
Assuming DD2 800 roughly equald DDR3 1066 it would require ~DDR3 1420 before there's a memory bottleneck with eight vs six cores. As I understand magny cours and interlagos, magny cours will have two IMC's with dual channel DDR3 1333 (2x128bit or 4x64bit). Interlagos will fit in the same socket and may have something like 1x256bit in addition but not 1x512bit, 2x256bit, 4x128bit or would that be possible in the same socket?

I tried that benchmark at home with my 5050e and 2GB DDR2 1066 under 64bit debian lenny. With gcc 4.3.4 i get ~3MB/s with the latest icc ~6MB/s (autopar enabled by default I assume).
justapost
 
Posts: 129
Joined: Sat Feb 21, 2009 6:39 am

Re: Advanced Clustering

Postby KTE on Sat Aug 29, 2009 1:34 pm

justapost wrote:I tried that benchmark at home with my 5050e and 2GB DDR2 1066 under 64bit debian lenny. With gcc 4.3.4 i get ~3MB/s with the latest icc ~6MB/s (autopar enabled by default I assume).

I think you mean GB/s :P
Here's what I get under Win 7 64b vs. Win XP 32b (K8 X2 at 2.0GHz / DDR2-800 4-3-4-9-16). 256MB gone to IGP, 2GB gone to RAMDisk, default binary.

stream.png


B/W should go up as accessing core number increases, unless B/W reaches a bottleneck and that bottleneck will hamper any hig-end server workload from scaling as all the cores fight for it. Istanbuls HT Assist does enable higher STREAM B/W ... actually I've just read AMD Blog that HT Assist is disabled on 2S servers, no wonder they get no benefit.
HT Assist, or the Probe Filter as it is sometimes called, works by using part of the processor's L3 cache as a directory cache. This directory cache tracks all cache lines cached in the system. Instead of generating numerous cache probes when checking a cache line the processor does a Probe Filter Lookup. This helps lower latency for accesses to local DRAM because there is no need to wait for probe responses when accessing local data. This also means there is less queuing delay due to the lower HyperTransport technology traffic. With significantly reduced probe traffic it effectively also increases system bandwidth performance. It also should be noted that the directory cache uses 1MB of the 6MB L3 cache in the case of the Six-Core AMD Opteron processor. As well, HT Assist is only enabled on 4-socket and 8-socket systems, where the performance benefits largely outweigh the small decrease in available L3 data cache. On the other hand, HT Assist is not enabled on 2-socket systems where there is much less cache probe traffic and the full L3 cache is utilized.

We've measured the difference of HT Assist on Six-Core AMD Opteron processors and the results are nothing but stunning. On the same 4-socket system, we measured 42GB/s of memory bandwidth with the STREAM benchmark with HT Assist, while only getting 25.5GB/s when HT Assist is disabled.* For 4-socket and 8-socket Six-Core AMD Opteron processor-based systems, this can translate into a significant performance uplift for applications that depend on cache performance, memory bandwidth, and system scalability.
You do not have the required permissions to view the files attached to this post.
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Re: Advanced Clustering

Postby justapost on Sat Aug 29, 2009 3:19 pm

KTE wrote:
justapost wrote:I tried that benchmark at home with my 5050e and 2GB DDR2 1066 under 64bit debian lenny. With gcc 4.3.4 i get ~3MB/s with the latest icc ~6MB/s (autopar enabled by default I assume).

I think you mean GB/s :P

Yepp.
Here's what I get under Win 7 64b vs. Win XP 32b (K8 X2 at 2.0GHz / DDR2-800 4-3-4-9-16). 256MB gone to IGP, 2GB gone to RAMDisk, default binary.
stream.png
[/quote]
Odd, I expected the results are bandwidth per core and not the sum of all threads running. I ran single threaded and thought 3GB/s are the max per core, so two (autopared with icc) can reach 6GB/s.
Will have to figure out how to make an openmp version under linux to compare gcc vs icc again.
Completey wrong assumption on my side, this one hurts. :?
KTE wrote:B/W should go up as accessing core number increases, unless B/W reaches a bottleneck and that bottleneck will hamper any hig-end server workload from scaling as all the cores fight for it. Istanbuls HT Assist does enable higher STREAM B/W ... actually I've just read AMD Blog that HT Assist is disabled on 2S servers, no wonder they get no benefit.
HT Assist, or the Probe Filter as it is sometimes called, works by using part of the processor's L3 cache as a directory cache. This directory cache tracks all cache lines cached in the system. Instead of generating numerous cache probes when checking a cache line the processor does a Probe Filter Lookup. This helps lower latency for accesses to local DRAM because there is no need to wait for probe responses when accessing local data. This also means there is less queuing delay due to the lower HyperTransport technology traffic. With significantly reduced probe traffic it effectively also increases system bandwidth performance. It also should be noted that the directory cache uses 1MB of the 6MB L3 cache in the case of the Six-Core AMD Opteron processor. As well, HT Assist is only enabled on 4-socket and 8-socket systems, where the performance benefits largely outweigh the small decrease in available L3 data cache. On the other hand, HT Assist is not enabled on 2-socket systems where there is much less cache probe traffic and the full L3 cache is utilized.

We've measured the difference of HT Assist on Six-Core AMD Opteron processors and the results are nothing but stunning. On the same 4-socket system, we measured 42GB/s of memory bandwidth with the STREAM benchmark with HT Assist, while only getting 25.5GB/s when HT Assist is disabled.* For 4-socket and 8-socket Six-Core AMD Opteron processor-based systems, this can translate into a significant performance uplift for applications that depend on cache performance, memory bandwidth, and system scalability.

So on the dual socket istanbul setup the cache probes between the two cpu's increase and there is less L3 cache available for each core compared to the dual shanghai setup. Seems 20-21GB/s is the limit for 2 x dual channel DDR2 800. Magny cours will reach this bandwidth with a single sockets solution and DDR3 1066 and a dual socket system of those mcm chips will hit 42GB/s or above 50GB/s with DDR3 1333.

On the other hand those specs say 85.6GB/s dram bandwidth for dual g34 systems.
http://www.semiaccurate.com/2009/08/24/ ... ocket-g34/
But on the intel side nehalem has triple DDR3 for four cores but will only have four DDR3 channels for eight cores.

So if IPC is the limit magny cores might have an advantage due to the higher core count and if memory is the limit intel's current advantage will shrink (on two and four socket systems).

Update: Here are my results 5050e 2.6GHz 2GB 1066 5-5-5-15
First and second used a binary build with gcc-4.3.4 and -O2 -fopenmp, third one was buidl with gcc-4.3.4 but with no optimizations, last one used icc 11.1 with -O2 -openmp.

Image
Image
Image
Image
justapost
 
Posts: 129
Joined: Sat Feb 21, 2009 6:39 am

Re: Advanced Clustering

Postby KTE on Sun Aug 30, 2009 10:25 pm

That's like saying STREAM vector desktop B/W is 17GB/s. We get about 60% of it at DDR3-1066. 85GB/s is just AMDs theoretical max with 8 channels and HT Assist gain added up. AMDs 1P STREAM DDR3-1066 efficiency is around 60%, low, and with DDR3-1333 it is even lower, unless you clock the IMC higher which isn't going to happen on such a huge die with the same process. With 4 channels in 2S, they could only register 20GB/s without HT Assist. With Magny-Cour in 2S, 8 channels, 40GB/s without HT Assist is max. I'd say they'd get slightly lower due to ccHT latency.

Beckon has 4 channels and SMBs in between to make things very fast. Without the SMBs (buffers), 6 DDR3-1333 channels in 2S, they can already do 37GB/s. Imagine another 2 channels. Intel has no worry about putting out 170W dies, they already did with Dunnington, so they could very well clock the IMC higher for more B/W gain since B/W is crucial in this segment.

And yeah, ICC is generally faster for most apps hence why they get used most often. STREAM runs single and multi-threaded workloads if you ask it to. It can really show what access a CPU is optimized for.
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Re: Advanced Clustering

Postby justapost on Mon Aug 31, 2009 9:26 am

KTE wrote:That's like saying STREAM vector desktop B/W is 17GB/s. We get about 60% of it at DDR3-1066. 85GB/s is just AMDs theoretical max with 8 channels and HT Assist gain added up. AMDs 1P STREAM DDR3-1066 efficiency is around 60%, low, and with DDR3-1333 it is even lower, unless you clock the IMC higher which isn't going to happen on such a huge die with the same process. With 4 channels in 2S, they could only register 20GB/s without HT Assist. With Magny-Cour in 2S, 8 channels, 40GB/s without HT Assist is max. I'd say they'd get slightly lower due to ccHT latency.

I'd expect Magny-Cour 2S should behave like Istanbul 4S but with DDR3 memory, so the Stream BW limit will be at ~50MB/s. As far as I understand Magny-Cours, that chip has two dual channel memory controllers so it are four and not eight channels for the chip.
KTE wrote:Beckon has 4 channels and SMBs in between to make things very fast. Without the SMBs (buffers), 6 DDR3-1333 channels in 2S, they can already do 37GB/s. Imagine another 2 channels. Intel has no worry about putting out 170W dies, they already did with Dunnington, so they could very well clock the IMC higher for more B/W gain since B/W is crucial in this segment.

Only info about beckton I have is in this article.
http://www.semiaccurate.com/2009/08/25/ ... s-and-all/
On the beginning they say the chip has four memory channels at the end it are eight. :?
Like Magny Cours the chip has two memory controllers each of them has two of those smb's and two channels.
To me a channel is an 64bit wide interface to the dimm's. If 37GB/s was the limit with three channels two channels will have a limit at ~25MB/s. So two of those controllers with two channels each should end up at ~50MB/s like magny-cours, whom will have 50% more 33% less efficient cores.
All above assumens the bandwidth efficiency will not change.
KTE wrote:And yeah, ICC is generally faster for most apps hence why they get used most often. STREAM runs single and multi-threaded workloads if you ask it to. It can really show what access a CPU is optimized for.

ICC results are roughly equal with the sandra memory bandwidth test, I assume it's the same algorythm here.
justapost
 
Posts: 129
Joined: Sat Feb 21, 2009 6:39 am

Re: Advanced Clustering

Postby KTE on Mon Aug 31, 2009 1:41 pm

justapost wrote:
KTE wrote:That's like saying STREAM vector desktop B/W is 17GB/s. We get about 60% of it at DDR3-1066. 85GB/s is just AMDs theoretical max with 8 channels and HT Assist gain added up. AMDs 1P STREAM DDR3-1066 efficiency is around 60%, low, and with DDR3-1333 it is even lower, unless you clock the IMC higher which isn't going to happen on such a huge die with the same process. With 4 channels in 2S, they could only register 20GB/s without HT Assist. With Magny-Cour in 2S, 8 channels, 40GB/s without HT Assist is max. I'd say they'd get slightly lower due to ccHT latency.

I'd expect Magny-Cour 2S should behave like Istanbul 4S but with DDR3 memory, so the Stream BW limit will be at ~50MB/s. As far as I understand Magny-Cours, that chip has two dual channel memory controllers so it are four and not eight channels for the chip.

I mentioned 2S, it is 8 channels. You don't have 1S with high-end.
KTE wrote:Beckon has 4 channels and SMBs in between to make things very fast. Without the SMBs (buffers), 6 DDR3-1333 channels in 2S, they can already do 37GB/s. Imagine another 2 channels. Intel has no worry about putting out 170W dies, they already did with Dunnington, so they could very well clock the IMC higher for more B/W gain since B/W is crucial in this segment.

Only info about beckton I have is in this article.
http://www.semiaccurate.com/2009/08/25/ ... s-and-all/
On the beginning they say the chip has four memory channels at the end it are eight. :?
Like Magny Cours the chip has two memory controllers each of them has two of those smb's and two channels.

Search Nehalem EX, there is plenty of info around. Charlie also explained it right there - when he is saying 4 channels he is talking about 1 chip. 8 channels, he is saying 2S if you read the rest of it. Beckton has been known for months to be a quad memory channel chip, just like Magny-Cours, so 8 channels in 2S, etc. The above STREAM benchmarks are 2S setups, hence why it's 4 channels for Istanbul and 6 channels for the Gainestown.
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Next

Return to CPU

Who is online

Users browsing this forum: Yahoo [Bot] and 0 guests

cron