by KTE on Sun Oct 31, 2010 5:51 pm
I finished reading about 3 weeks back... there were too many hyberboles with unknowns (except Intels private word and dependance on the fact that you expect them to get all the implementations working well) that kept throwing me off at the beginning of the article. Not the sort I expected but it sort of made sense as in where it was coming from after reading the whole article.
First thing obvious is how massively more information David knows of Intels knew arch, reasonings and expectations, from Intel. In contrast, very little on Bulldozer and almost nothing of expectations and reasonings from AMD except a few macro level tidbits that are public for long. That makes us be sure of only one side.
Second thing I had to say when ending is wow. They are really going in the direction I think is best, on many fronts. From a high level, SNB looked nothing more than NHM with a few changes, initially. I wasn't interested except for the GPU interface. The performance preview showed 10-20% CPU gain at best which seemed to back up the high level CPU diagrams. However, that is totally not what is going on beneath as nearly everything has been re-arranged, overhauled, beefed up and replaced for one thing: power efficiency. Every other major performance/power factor has been updated. You keep thinking Intel may slip into apathy from such a big lead - which I'd hate - but it clearly isn't so. They're still pushing on fast in every direction feasible, as they should be doing.
From early in 2005 (at the very least) Intel started running studies focusing on maximizing the balance between extractable IPC and power since each segment was limited to a different power band with 130-150W being the absolute peak. They obviously already knew everything in design is a tradeoff but the performance efficiency curve has a middle peak for exploiting the best IPC, power and frequency domains. So they realized that frequencies can be shot semi-high in a "few-core" design if power is controlled within these limits and chip complexity can also increase for higher extractable IPC on average workloads, IF, we can keep improving the Energy per Instruction (or Instructions per Joule) and the Instructions per Second executed and retired with each gen within economically feasible die sizes, because that is what will determine end leadership; performance and power efficiency. For instance, they found Cedar Mill having many times higher EPI compared to the i486 and high absolute power and frequency. In contrast, they found the PIII and P6 cores to be very close to the EPI of i486 while in absolute terms, far faster in workloads, mediocre frequencies and low absolute power too. So Tejas - which would be 6GHz shipping with ease on todays 32nm process - was scrapped due to high EPI and too high power for the required frequencies, and the P6 based mobile designs started recieving the attention for gradual improvements while keeping the low EPI trend going. From Penryn, I could see that most of the designs were P6 being enhanced with the best 90's additions: P4, Power and Alpha features and techniques slowly being added one bit at a time as they deem best. With SNB, that cycle basically comes to a complete and now they're surpassing other architectures with a level of integration and complexity which has not yet been done on some fronts in the Client/Server market. They are trying to piece together the best of many approaches around and focusing implicitly on the exact factors POWER focused on to gain huge success with Power 7 in its $13 bill UNIX segment: high per core performance, high per chip performance with SMT, high per system performance with scalability due to interconnects, memory subsystem and cache, semi-high frequency, high throughput, high bandwidth, on-chip on-demand power and frequency monitoring and control. On paper, it looks like a job very well done. Some notes follow (I'm leaving them as short as I can):
- POWER has TPDM cards that are far more granular, complex and powerful as part of Intelligent Energy to monitor and control clocks, cache, IO, voltages and power of a chip, module and system through the AEM software based on user demand. You could have ~109% of the default frequency if required if your task needed less cores/threads (with upto 256 core systems) or more cache per core for instance. Part of that is TurboCore and MaxTurbo - which are concepts that Intel built upon and enhanced for a more Client suited purpose in NHM/SNB. SNB sees even more fine grained improvement on this concept which is very well for consumers to maximize performance and power efficiency. However, Intel doesn't by default allow you to remove pre-set TDC, TDP and clock caps on mainstream CPUs (I mean, re-configure for your application) which I'm hoping will change, or at least the default boundaries be extended for higher possible maximum frequencies during the 25 second "burst" periods, based on the thermal profile.
- Clock and voltage domain switching latency can break performance as we've seen priorly with multi-threaded CPUs on OS schedulers that peristently spread the load around. NHM and F10h power efficiency practically underperformed as designed greatly due to these factors. Tens of instructions wasted just due to these latencies. They kill the power efficiency by juggling the workload on every core but this cross movement and clock/voltage lock from/to sleep imposes enough latencies to substantially impact performance of low threaded workloads even more. I wonder how the 25 second frequency boost will circumvent these issues to enhance realized performance additional to the normal Turbo Boost.
- The Uncore containing the L3$, IMC, PLLs and PCU on NHM wasn't clock or power gated, in any case, nor could the power and thermal characteristics of these segments be measured and controlled by the PCU accordind to all available data, so they weren't taken into account for any chip level power and performance management. Hence, the PCU in NHM was a rudimentary implementation and the TDP/TDC and thermal values it reported were only applicable for the 12v supplied Core secton it was monitoring. An artifiact of this was the misinformed hype of very low power sleep modes due to power gating and sleep transistors, which was applicable only on a Core level and false at a MCM level. At a Core level, the power in sleep was very low but even then, only about even of what AMDs total IMC + PLLs + L3 + HT + Cores combined were drawing without any such deep sleep power gating nor HKMG (which was odd). Now that the System Agent (I really do not like these constant changes in terminology every other new release between MFGs) houses the PCU, I wonder if any of the ring bus, L3, PLLs and IMC are actually thermal and power monitored and gated?
- What I don't yet know is, if the L3 is clocked at Core, then what happens when the Core section is put into C6 (in terms of frequency)? Or when 3 of 4 cores are set to C6, so the 1 awake executing is running at pre-set Turbo Boost frequencies or the additional "25s" frequency boost... does Turbo Boost also increase the L3 clocks?
- Initially I was impressed with the vastly low latencies for L3$ being thrown around bereft of context but they don't look improved at all now. Each core only gets a 2MB slice, akin to POWER7 where each core gets 4MB L3 slice, and both share 256KB L2 per core. L3 is about 1.6x higher in clocks on SNB mainstream, so in terms of load-to-use time from the 2MB slice, it's lower due to the different approach with inherent tradeoffs.
- With one core executing, or two, does the L3 re-partition itself dynamically between them (a la POWER7)?
- The new display and media engine should really boost uptake in multimedia, mainstream and low-end setups.
- Despite many parts being beefed up including the instruction window, most of the uarch seems to be limited by 4 uops at many levels.
- A slight slip where he states the Scheduler is nearly twice the capacity in SNB while they are 1.5x bigger.
- There seem to be some really nifty 256b workarounds at the memory and execution stages.
- Intels GPU driver, which now also manages the L3 usage... enough said I think. They can really bork things up through this. GPU driver has never been their strong suit.
- Shared L3 with the graphics? The GPU cores can make use of any of the 4x 2MB slices, right? It'll be interesting to see how this works out as the bandwidth which graphics cores love is there but cache pollution and capacity are problematic compared to even low-end discrete cards which have dedicated 12-21GB/s of 512MB-1GB available onboard solely to the GPU cores without any sharing. Especially in the latest multi-threaded games utilizing much AI and Physics which require heavy CPU processing. Take a low-end HD 5450, decrease that local dedicated memory from 512-1024MB, and see it stutter and fall to a slide show.
- Discrete GPU unit shipments are going down and 70% or so of all shipments have been shown to be sub $100-150... It's true, SNB, if it delivers on the integrated EGP (Embedded Graphics Processor) and media processing front , can kill that 40-50% of low-end AMD/Nvidia market completely if AMD/Nv cannot outclass them immediately, since the graphics and media processing will be intimately shipped with every processor without additional cost/space i.e. after 18 months we could be looking at this graphics market: Intel 40%, AMD 31%, Nvidia 29%. Intel has the power to pervasively push their platform everywhere.
- Modern MPU performance is critically limited and hence dictated by cache miss/hit rates, speculation, branch prediction and Ld-St performance. Memory subsystem latencies also play a major, critical part. There's major Ld-St, cache, branch prediction and memory subsystem improvements in SNB overall. 1/2 the ops being Ld-St at a 2:1 ratio is very much the norm. Combined, I expect re-compiled software to boost performance more than what the early previews portrayed but I don't think the full potential of these "upgrades" are going to be realized with SNB. I see them as laying the groundwork and expanding the Core for future generations where these updates will really be able to crank out increased performance and efficiency with some more complimentary changes required. Intel tends to work as such, laying in future proofing within the designs.
- When I heard about Atom and SNB, I basically thought what Cyrix did with MediaGX and SiS did with SiS55x more than a decade back, is now the focal race in this industry. Albeit with block level modularity and flexibility in mind so one could scale up/down as required to fit the various market segments.
- The philosophies have become vastly different at Intel and AMD. Both primarily focusing on different aspects. Barcelona and SNB both have a massive list of changes but they only result in medicore performance increases at the Core level. A decade back, a few changes were netting massive general application performance gains. Nowadays, a great list of changes are only netting low to mediocre general performance gains but instead, the updates seem to be primarily boosting performance in niche applications. I suppose the '2x chip logic complexity equates to 1.4x IPC gain' Rule seems to still be sailing smooth, which in turn, doesn't make me fond of new processors where most of the additional transistors added are for cache, control and buffers.
- AMD held the highest market share in 2006, back when they made $7.5 bill in revenue which they've never neared since. My take has always been, in any game, if you're in the lead, your competitor can only out-class you where and when you faulter. So in many ways, your competitors success is a reflection of your failings, and vice versa. AMD did that with Opteron/K8 when Intel made a series of major uncalled blunders. Ever since 2006 though, Intel hasn't made a slip, let alone at that scale, and has rigorously increased innovation and performance leadership creating more markets for their processors. SNB strongly continues and builds on that trend which leaves very little room for AMD to infliltrate or gain market momentum. Due to vastly different processor topologies and instruction set optimizations and latencies between SNB and BD now, software will need to be more radically changed to perform optimistically on both architectures.With all of this in view, the massive potential and likely loser in all this is AMD's BD as everyone will recompile and optimize for SNB, but that can't be said for the former. Software now has become even more critical to how these processors stand.
- I sat down to write this more than 2 weeks back but never sat back down till now, getting back from hospital to actually finish it. I wanted to read 12-15 papers before posting but didn't get to read any. All the points I had in mind when I finished reading are now lost. So, yep :/