[RWT] Intel's Sandy Bridge Microarchitecture

Moderator: M_S

[RWT] Intel's Sandy Bridge Microarchitecture

Postby KTE on Mon Sep 27, 2010 7:55 am

D.Kanter put out this article about a week back: http://www.realworldtech.com/page.cfm?A ... 191937&p=1

Haven't read it yet but I'm hoping for an interesting read.
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby M_S on Mon Sep 27, 2010 8:09 am

You beat me to the punch :-)
M_S
 
Posts: 1456
Joined: Mon Nov 03, 2008 12:25 pm

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby DavidC1 on Mon Sep 27, 2010 11:33 am

Sandy Bridge is looking more and more impressive every day. The changes brought on that chip are more significant than the ones Core Duo received to become Core 2.

The only ? is now on the graphics. I'm having high hopes on this one.
DavidC1
 
Posts: 58
Joined: Wed Feb 03, 2010 7:08 am

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby M_S on Mon Sep 27, 2010 11:42 am

I can't wait to get my hands on one of these CPUs .. but then, I can't wait for Bulldozer and Bobcat either. It'll be a very interesting year next year.
M_S
 
Posts: 1456
Joined: Mon Nov 03, 2008 12:25 pm

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby KTE on Sun Oct 31, 2010 5:51 pm

I finished reading about 3 weeks back... there were too many hyberboles with unknowns (except Intels private word and dependance on the fact that you expect them to get all the implementations working well) that kept throwing me off at the beginning of the article. Not the sort I expected but it sort of made sense as in where it was coming from after reading the whole article.

First thing obvious is how massively more information David knows of Intels knew arch, reasonings and expectations, from Intel. In contrast, very little on Bulldozer and almost nothing of expectations and reasonings from AMD except a few macro level tidbits that are public for long. That makes us be sure of only one side.

Second thing I had to say when ending is wow. They are really going in the direction I think is best, on many fronts. From a high level, SNB looked nothing more than NHM with a few changes, initially. I wasn't interested except for the GPU interface. The performance preview showed 10-20% CPU gain at best which seemed to back up the high level CPU diagrams. However, that is totally not what is going on beneath as nearly everything has been re-arranged, overhauled, beefed up and replaced for one thing: power efficiency. Every other major performance/power factor has been updated. You keep thinking Intel may slip into apathy from such a big lead - which I'd hate - but it clearly isn't so. They're still pushing on fast in every direction feasible, as they should be doing.

From early in 2005 (at the very least) Intel started running studies focusing on maximizing the balance between extractable IPC and power since each segment was limited to a different power band with 130-150W being the absolute peak. They obviously already knew everything in design is a tradeoff but the performance efficiency curve has a middle peak for exploiting the best IPC, power and frequency domains. So they realized that frequencies can be shot semi-high in a "few-core" design if power is controlled within these limits and chip complexity can also increase for higher extractable IPC on average workloads, IF, we can keep improving the Energy per Instruction (or Instructions per Joule) and the Instructions per Second executed and retired with each gen within economically feasible die sizes, because that is what will determine end leadership; performance and power efficiency. For instance, they found Cedar Mill having many times higher EPI compared to the i486 and high absolute power and frequency. In contrast, they found the PIII and P6 cores to be very close to the EPI of i486 while in absolute terms, far faster in workloads, mediocre frequencies and low absolute power too. So Tejas - which would be 6GHz shipping with ease on todays 32nm process - was scrapped due to high EPI and too high power for the required frequencies, and the P6 based mobile designs started recieving the attention for gradual improvements while keeping the low EPI trend going. From Penryn, I could see that most of the designs were P6 being enhanced with the best 90's additions: P4, Power and Alpha features and techniques slowly being added one bit at a time as they deem best. With SNB, that cycle basically comes to a complete and now they're surpassing other architectures with a level of integration and complexity which has not yet been done on some fronts in the Client/Server market. They are trying to piece together the best of many approaches around and focusing implicitly on the exact factors POWER focused on to gain huge success with Power 7 in its $13 bill UNIX segment: high per core performance, high per chip performance with SMT, high per system performance with scalability due to interconnects, memory subsystem and cache, semi-high frequency, high throughput, high bandwidth, on-chip on-demand power and frequency monitoring and control. On paper, it looks like a job very well done. Some notes follow (I'm leaving them as short as I can):

- POWER has TPDM cards that are far more granular, complex and powerful as part of Intelligent Energy to monitor and control clocks, cache, IO, voltages and power of a chip, module and system through the AEM software based on user demand. You could have ~109% of the default frequency if required if your task needed less cores/threads (with upto 256 core systems) or more cache per core for instance. Part of that is TurboCore and MaxTurbo - which are concepts that Intel built upon and enhanced for a more Client suited purpose in NHM/SNB. SNB sees even more fine grained improvement on this concept which is very well for consumers to maximize performance and power efficiency. However, Intel doesn't by default allow you to remove pre-set TDC, TDP and clock caps on mainstream CPUs (I mean, re-configure for your application) which I'm hoping will change, or at least the default boundaries be extended for higher possible maximum frequencies during the 25 second "burst" periods, based on the thermal profile.

- Clock and voltage domain switching latency can break performance as we've seen priorly with multi-threaded CPUs on OS schedulers that peristently spread the load around. NHM and F10h power efficiency practically underperformed as designed greatly due to these factors. Tens of instructions wasted just due to these latencies. They kill the power efficiency by juggling the workload on every core but this cross movement and clock/voltage lock from/to sleep imposes enough latencies to substantially impact performance of low threaded workloads even more. I wonder how the 25 second frequency boost will circumvent these issues to enhance realized performance additional to the normal Turbo Boost.

- The Uncore containing the L3$, IMC, PLLs and PCU on NHM wasn't clock or power gated, in any case, nor could the power and thermal characteristics of these segments be measured and controlled by the PCU accordind to all available data, so they weren't taken into account for any chip level power and performance management. Hence, the PCU in NHM was a rudimentary implementation and the TDP/TDC and thermal values it reported were only applicable for the 12v supplied Core secton it was monitoring. An artifiact of this was the misinformed hype of very low power sleep modes due to power gating and sleep transistors, which was applicable only on a Core level and false at a MCM level. At a Core level, the power in sleep was very low but even then, only about even of what AMDs total IMC + PLLs + L3 + HT + Cores combined were drawing without any such deep sleep power gating nor HKMG (which was odd). Now that the System Agent (I really do not like these constant changes in terminology every other new release between MFGs) houses the PCU, I wonder if any of the ring bus, L3, PLLs and IMC are actually thermal and power monitored and gated?

- What I don't yet know is, if the L3 is clocked at Core, then what happens when the Core section is put into C6 (in terms of frequency)? Or when 3 of 4 cores are set to C6, so the 1 awake executing is running at pre-set Turbo Boost frequencies or the additional "25s" frequency boost... does Turbo Boost also increase the L3 clocks?

- Initially I was impressed with the vastly low latencies for L3$ being thrown around bereft of context but they don't look improved at all now. Each core only gets a 2MB slice, akin to POWER7 where each core gets 4MB L3 slice, and both share 256KB L2 per core. L3 is about 1.6x higher in clocks on SNB mainstream, so in terms of load-to-use time from the 2MB slice, it's lower due to the different approach with inherent tradeoffs.

- With one core executing, or two, does the L3 re-partition itself dynamically between them (a la POWER7)?

- The new display and media engine should really boost uptake in multimedia, mainstream and low-end setups.

- Despite many parts being beefed up including the instruction window, most of the uarch seems to be limited by 4 uops at many levels.

- A slight slip where he states the Scheduler is nearly twice the capacity in SNB while they are 1.5x bigger.

- There seem to be some really nifty 256b workarounds at the memory and execution stages.

- Intels GPU driver, which now also manages the L3 usage... enough said I think. They can really bork things up through this. GPU driver has never been their strong suit.

- Shared L3 with the graphics? The GPU cores can make use of any of the 4x 2MB slices, right? It'll be interesting to see how this works out as the bandwidth which graphics cores love is there but cache pollution and capacity are problematic compared to even low-end discrete cards which have dedicated 12-21GB/s of 512MB-1GB available onboard solely to the GPU cores without any sharing. Especially in the latest multi-threaded games utilizing much AI and Physics which require heavy CPU processing. Take a low-end HD 5450, decrease that local dedicated memory from 512-1024MB, and see it stutter and fall to a slide show.

- Discrete GPU unit shipments are going down and 70% or so of all shipments have been shown to be sub $100-150... It's true, SNB, if it delivers on the integrated EGP (Embedded Graphics Processor) and media processing front , can kill that 40-50% of low-end AMD/Nvidia market completely if AMD/Nv cannot outclass them immediately, since the graphics and media processing will be intimately shipped with every processor without additional cost/space i.e. after 18 months we could be looking at this graphics market: Intel 40%, AMD 31%, Nvidia 29%. Intel has the power to pervasively push their platform everywhere.

- Modern MPU performance is critically limited and hence dictated by cache miss/hit rates, speculation, branch prediction and Ld-St performance. Memory subsystem latencies also play a major, critical part. There's major Ld-St, cache, branch prediction and memory subsystem improvements in SNB overall. 1/2 the ops being Ld-St at a 2:1 ratio is very much the norm. Combined, I expect re-compiled software to boost performance more than what the early previews portrayed but I don't think the full potential of these "upgrades" are going to be realized with SNB. I see them as laying the groundwork and expanding the Core for future generations where these updates will really be able to crank out increased performance and efficiency with some more complimentary changes required. Intel tends to work as such, laying in future proofing within the designs.

- When I heard about Atom and SNB, I basically thought what Cyrix did with MediaGX and SiS did with SiS55x more than a decade back, is now the focal race in this industry. Albeit with block level modularity and flexibility in mind so one could scale up/down as required to fit the various market segments.

- The philosophies have become vastly different at Intel and AMD. Both primarily focusing on different aspects. Barcelona and SNB both have a massive list of changes but they only result in medicore performance increases at the Core level. A decade back, a few changes were netting massive general application performance gains. Nowadays, a great list of changes are only netting low to mediocre general performance gains but instead, the updates seem to be primarily boosting performance in niche applications. I suppose the '2x chip logic complexity equates to 1.4x IPC gain' Rule seems to still be sailing smooth, which in turn, doesn't make me fond of new processors where most of the additional transistors added are for cache, control and buffers.

- AMD held the highest market share in 2006, back when they made $7.5 bill in revenue which they've never neared since. My take has always been, in any game, if you're in the lead, your competitor can only out-class you where and when you faulter. So in many ways, your competitors success is a reflection of your failings, and vice versa. AMD did that with Opteron/K8 when Intel made a series of major uncalled blunders. Ever since 2006 though, Intel hasn't made a slip, let alone at that scale, and has rigorously increased innovation and performance leadership creating more markets for their processors. SNB strongly continues and builds on that trend which leaves very little room for AMD to infliltrate or gain market momentum. Due to vastly different processor topologies and instruction set optimizations and latencies between SNB and BD now, software will need to be more radically changed to perform optimistically on both architectures.With all of this in view, the massive potential and likely loser in all this is AMD's BD as everyone will recompile and optimize for SNB, but that can't be said for the former. Software now has become even more critical to how these processors stand.

- I sat down to write this more than 2 weeks back but never sat back down till now, getting back from hospital to actually finish it. I wanted to read 12-15 papers before posting but didn't get to read any. All the points I had in mind when I finished reading are now lost. So, yep :/
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby M_S on Mon Nov 01, 2010 8:55 am

Wow... quite a post! 8-)

You bring up a few very interesting points, that is the lack of slip at Intel in the last 5 years, which is kind of true but there were a few exceptions like the FBDIMM. This gets me back to Craig Barrett's keynote at IDF.. 2002???) about bringing diversification to Intel and not letting a single technology or idea run the entire company as it had been done in the past). Needless to say that in a situation like that, there will be lemons - by definition - but they will be in the noise compared to the overall technology progress that is achieved not only "before the fact" meaning on the concept level but actually validated after the fact, that is, looking at the different achievements "after the fact" meaning in a real world scenario. I am not sure about how much of that was to the likings of Paul Otellini - after all, he is a bean counter, at least by training - but in the end, Intel has come out with a bunch of great achievement on pretty much any field they have been playing on.

Arguably, there were a few "dinosaurs" - for lack of a better word - thrown into the mix, in a way Nehalem and even Lynnfield fall into that category just with respect to their power efficiency but they were important milestones for performance, which still defines the image of a company like nothing else.

The other interesting question is regarding the L3 cache. SRAMs are power hogs, there is no other way to describe it but there is no substitute. The problems start when you start power-gating the system or else clock throttling some of the cores. Keep in mind that the uncore typically runs at a defined ratio with the memory bus and that's where it gets really ugly because you have long traces and you need to constantly recalibrate flight times and adjust for latencies.

In other words, DRAM in its most general definition is extremely resilient to frequency changes on the fly and also in general. Things may have changed a bit over the past few years but when I was still more intimately involved in some of the design processes, there was a range of frequencies that could be achieved, above and below were prone to errors. Add the fact that certain latencies like CAS-DL have to be calculated on startup using ns-delays that are then converted into number of cycles which are then programmed into the mode register settings. I have yet to see a DRAM that would reliably allow changing CAS DL on the fly or else, if you lower the frequency, would not lose the data in the various pipeline stages because of leakage or field effects. Keep in mind that DRAM is "commodity in its extreme form" and it is not that these things may not be technically possible but rather that at the current price point nobody would even bother.

This ties back into the un-core which is timing-dependent on the DRAM and so you can't do very much with respect to changing frequencies. In this context, I assume that running at "core speed" refers to a non-turbo frequency as a "benchmark" setting and that the actual CPU cores can go above and below with relative ease because you only need to have some registers to buffer the data in between rather than dealing with traces on a motherboard.

The question now is about sleep. The L3 doubles as snoop filter, meaning that it'll have to stay awake as long as any of the cores is active and it has to be running at default frequency at least in the portion containing the critical data for cache coherency. On the other hand, Intel has done a lot of work with respect to selective clock gating of parts of the array starting with Banias, so realistically, you don't need to do frequency-changes if you can suspend the clocks to let's say 60-80% of the SRAM. It only takes a few cycles to come back up after asserting CKE, which is totally negligible compared to even the monitor coming back to life. So it is all about trade-offs and it can be programmed to best match one or the other scenario. At least that's my somewhat uneducated bird's eye view :-)
M_S
 
Posts: 1456
Joined: Mon Nov 03, 2008 12:25 pm

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby KTE on Mon Nov 01, 2010 7:07 pm

M_S wrote:You bring up a few very interesting points, that is the lack of slip at Intel in the last 5 years, which is kind of true but there were a few exceptions like the FBDIMM. This gets me back to Craig Barrett's keynote at IDF.. 2002???) about bringing diversification to Intel and not letting a single technology or idea run the entire company as it had been done in the past). Needless to say that in a situation like that, there will be lemons - by definition - but they will be in the noise compared to the overall technology progress that is achieved not only "before the fact" meaning on the concept level but actually validated after the fact, that is, looking at the different achievements "after the fact" meaning in a real world scenario.

Yes, they made many errors but as you understood, I basically meant blunders that are show stoppers in terms of progress of CPU frequency/power/performance where the competition can take advantage and gain the performance/power/frequency lead.

The other interesting question is regarding the L3 cache. SRAMs are power hogs, there is no other way to describe it but there is no substitute. The problems start when you start power-gating the system or else clock throttling some of the cores. Keep in mind that the uncore typically runs at a defined ratio with the memory bus and that's where it gets really ugly because you have long traces and you need to constantly recalibrate flight times and adjust for latencies.

In other words, DRAM in its most general definition is extremely resilient to frequency changes on the fly and also in general. Things may have changed a bit over the past few years but when I was still more intimately involved in some of the design processes, there was a range of frequencies that could be achieved, above and below were prone to errors. Add the fact that certain latencies like CAS-DL have to be calculated on startup using ns-delays that are then converted into number of cycles which are then programmed into the mode register settings. I have yet to see a DRAM that would reliably allow changing CAS DL on the fly or else, if you lower the frequency, would not lose the data in the various pipeline stages because of leakage or field effects. Keep in mind that DRAM is "commodity in its extreme form" and it is not that these things may not be technically possible but rather that at the current price point nobody would even bother.

That's interesting. Similar principles actually still apply to the CPU logic too. If you leave the frequency/voltage window large, it means great leakage power and there's no way around this. Ironically, that means the tighter the range they keep the voltage/frequency changes, the better they can optimize for power in that band anc control leakage. That's why Turbo Boost and such tech doesn't scale the chip power as expected by the traditional formula but there's higher associated leakage power to account for. Then you have power/clock differences caused by static and domino logic for SRAM vs. the rest.

At least that's my somewhat uneducated bird's eye view :-)

I'll take your "somewhat uneducated bird's eye view" to my armchair caveman view any day... :D
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby DavidC1 on Tue Nov 02, 2010 11:07 pm

Great post by KTE. Mainstream Nehalem-class CPUs(Clarksfield/Lynnfield) have Uncore power gating.

With one core executing, or two, does the L3 re-partition itself dynamically between them (a la POWER7)?


Isn't that the point of the Ring Bus? From David Kanter's article, the latency is 26-31 cycles, which has a range because further cache blocks take extra cycles to access. In Power 7 IBM quotes two numbers for L3 latency, one for the dedicated block and another for the whole but I did ask Charlie Demerjian about this, and he says on Sandy Bridge the latency numbers are for accessing the whole cache in general.

I suppose the '2x chip logic complexity equates to 1.4x IPC gain' Rule seems to still be sailing smooth, which in turn, doesn't make me fond of new processors where most of the additional transistors added are for cache, control and buffers.


I did some die size estimations and the Sandy Bridge cores aren't much bigger, if at all than Westmere. I also think you are being pessimistic on the performance gains as even the Core 2 gains were only 15-20% over Core Duo.
DavidC1
 
Posts: 58
Joined: Wed Feb 03, 2010 7:08 am

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby M_S on Wed Nov 03, 2010 8:02 am

DavidC1 wrote:I did some die size estimations and the Sandy Bridge cores aren't much bigger, if at all than Westmere. I also think you are being pessimistic on the performance gains as even the Core 2 gains were only 15-20% over Core Duo.


I think this refers more to the "Katmai New Instructions" and "3DNow" / "SSE.x" than the more recent advances. On the other hand, I agree that even a 10% increase nowadays on the system level is huge but the days of doubling performance from one design to the next are past .. maybe not?

DavidC1 wrote:Isn't that the point of the Ring Bus? From David Kanter's article, the latency is 26-31 cycles, which has a range because further cache blocks take extra cycles to access. In Power 7 IBM quotes two numbers for L3 latency, one for the dedicated block and another for the whole but I did ask Charlie Demerjian about this, and he says on Sandy Bridge the latency numbers are for accessing the whole cache in general.


Thanks for mentioning this, I am always scratching my head when I look at discussions over a 1 or 2 cycle difference in cache access latency, especially when it goes hand in hand with an increase in cache size. Most of these arguments are along the same quality as saying that a HDD with a longer stroke is slower than one with a shorter stroke, when in reality, comparable transfers are identical and you only add an extension (in this case ID) that by definition is slower. Then HDTach and the "average" sequential transfer number are used as a basis to marvel about what the hell did they do to make the thing slower. :roll:

With larger caches and increased block size, this kind of consideration is becoming increasingly important, there are near blocks (preferred) and there is the entire cache and that latency is an average of any (non-standardized) test pattern that is marching across the array at any stride length.
M_S
 
Posts: 1456
Joined: Mon Nov 03, 2008 12:25 pm

Re: [RWT] Intel's Sandy Bridge Microarchitecture

Postby KTE on Wed Nov 03, 2010 2:00 pm

DavidC1 wrote:Mainstream Nehalem-class CPUs(Clarksfield/Lynnfield) have Uncore power gating.

Really? What exactly in the Uncore is power gated then, and when?
I have seen no information on this (forum rumors discarded). The Uncore is fed a few different voltage supplies in contrast to the Core.

Isn't that the point of the Ring Bus? From David Kanter's article, the latency is 26-31 cycles, which has a range because further cache blocks take extra cycles to access. In Power 7 IBM quotes two numbers for L3 latency, one for the dedicated block and another for the whole but I did ask Charlie Demerjian about this, and he says on Sandy Bridge the latency numbers are for accessing the whole cache in general.

I don't quite trust IBMs cache access PR figures based on historical trends. There's many ways to make them appear much better than realistic code load latencies. That also goes for STREAM figures and process Lg and Idsat PR figures.

In the article, D.Kanter doesn't mention the full cache access availability to any core, instead, dedicated, partitioned slices. So I'm unsure about those latencies. Yonah introduced Smart Cache which was kept in successive generations allowing any core to access the whole cache when another doesn't require it. I haven't seen mention of this for SNB.

L3$ runs at core clock now. AFAIK, that clock is generated by the PCH like with AMD CPUs (or whatever they called their IO chip now). L3 clock cycle figues given around usually differ based on test access patterns, size and locality. 26-31 cycles would be excellent if that's full cache access and not just a dedicated slice.

I did some die size estimations and the Sandy Bridge cores aren't much bigger, if at all than Westmere. I also think you are being pessimistic on the performance gains as even the Core 2 gains were only 15-20% over Core Duo.

Indeed, core size isn't much different but chip complexity is conventionally measured by logic transistor count.

Do we have SNB Core transistor count details?

I don't believe 10-20% is all we'll see in gains. Remember, this is just clock per clock. Customers don't see this and what customers get is really what matters in the end. Turbo Boost will likely add another 5-15% depending on the CPU and code so to the customer, it'll look like 15-35% increase. But these are mass market Desktop/Server CPUs I'm talking about. Yonah was an unknown entity in these segments before Conroe hence, mass consumers experienced the NetBurst > Conroe transition, rather than Core Duo > Core 2 Duo. And quite frankly, the gain was massive at equivalent prices with vastly lower power at far lower frequencies. Win, win, win.

Per clock, even the Dothan > Yonah transition was 10-60% faster at about 15% lower power. Performance wise, NHM > SNB looks far more comparable to this than NetBurst > Conroe. I think the bigger focus with SNB is integration, modularity, flexibility and power efficiency.
KTE
 
Posts: 1011
Joined: Tue Nov 04, 2008 2:33 am

Next

Return to CPU

Who is online

Users browsing this forum: Google [Bot] and 0 guests

cron