Intel's Sandy Bridge II: HD Gfx and AVX Print E-mail
Written by Michael Schuette   
Jan 16, 2011 at 06:38 PM



Advanced Vector Extensions

Advanced Vector Extensions are an extension to the x 86 instruction set architecture geared at increasing floating point performance that were originally proposed by Intel. Briefly, AVX is part of the same family of extensions as 3DNow and SSE in its various generations and falls under the umbrella of single instruction/multiple data extensions. AVX increases the width of the SIMD register file from 128-bits to 256-bits and the register file set is further renamed from XMM0-15 to YMM0-15. Legacy SSE instructions operate on the lower half of the YMM register file, that is bits 0-127.

AVX is supposed to dramatically increase floating point performance by essentially doubling the floating point throughput by doubling the width. So far, no final release of any Microsoft OS is supporting AVX the instruction set, howewer, there is a silver lining on the horizon in that the latest beta and RCs of Windows7 SP1 finally implement AVX support. Given the essentially non-existent support for AVX (which will, of course change) it is not surprising that currently, there are no applications out there that take advantage of the doubled floating point throughput. Notable exceptions are Sisoft Sandra and AIDA 64, both of which provide at least theoretical benchmarks.

We used SiSoft Sandra's multimedia processor benchmark to see whether there was any truth to the claims:



The top portion shows the Core i7-2600K with AVX enabled in the benchmark and the bottom part shows the same system configuration (Bearup Lake) running with AVX disabled. Keep in mind that in the latter case, the upper half of the register files will not be used, in other words, only 1/2 of the floating point path can be used. The first line shows the aggregate value of both integer and floating point performance, followed by the integer throughput. Please note that there is only a marginal difference in integer performance, which is consistent with the parameters tested in that integer instructions on YMM registers may only be supported with future implementations of AVX (tentatively AVX3). The third and fourth lines show the floating point throughput using either SSE or else AVX instructions. The benchmarks suggest that there is indeed a doubling of FPU performance by using all available register width.

In real life applications, the situation is somewhat different. We were lucky enough to hook up with Inartis to run the latest Kribibench 3D version, using Kribiplayer, a browser-based 3D-modeling/rendering program based on the Kribi 3.0 engine. As a word of caution, the benchmark was developed completely blind and is at this moment not optimized for AVX or tuned to make the best usage of AVX, so please, regard these benchmarks as work in progress.

We ran all scenes but the relative results were essentially the same regardless of which scene we used. Needless to say that "legacy processors" such as the Phenom II or the Gulftown cannot take advantage of AVX instructions, nor can the Shanghai-based Opterons (all of which were run as control and for reference).



The graph shows the average fps for one single run at 1024x768 with (top bar) and without (lower bar). It it quite obvious that there is an approximately 11% speed-up of both the 2500K and the 2600K, whereas the control CPUs do not show any effect of enabling/disabling AVX. As mentioned above, this version of Kribibench is still work in progress, it is very likely that future versions will show greater speed-up factor than the 11% achieved here. However, there are also a number of caveats regarding real-world applications as opposed to synthetic work loads.

  • 1) The access path of the L2 cache in SB is 128 bit wide and in so far, any application relying on the "mid-level" cache will face the issue that the AVX instructions will have to be split into two parts and then reassembled to the full 256 bit. This means that SB inherently favors legacy SSE instructions over AVX, even if the latter are officially supported.
    Update: This information from Intels software developer forum turned out to be incorrect, the access path is 32-Bytes or 256-bits wide. This is consistent with cache bandwidth measurements using Sandra that show 22 Bytes transfer/cycle (which would be impossible to achieve with a 16-Bytes wide access path.
  • 2) The instruction cache with its optimizations for decoded Uops will in its present form dis-favor AVX, simply because of size restrictions.
    Update: After reading up a bit more on Intel's Software Forum, this argument does not seem to hold.
  • 3) Even if the L2 interconnect path is increased to 256 bit width, only fully aligned vectors can pass in a single cycle whereas everything else needs to be split into an upper and a lower segment.
    Update: The problem with this argument is that only full cachelines are transferred anyway, meaning that inherently, there is at least some kind of alignment.

Given all these limitations and restrictions, it is reasonable to assume a 25-30% performance increase in AVX-enabled applications, rather than a doubling in performance as suggested by theoretical benchmarks.

Discuss this article in our forums



Last Updated ( Feb 21, 2011 at 05:08 AM )
<Previous Article   Next Article>