FLOPS Becoming Almost Free With Intel’s AVX

HPC and other Hyperscale customers should be really excited about Intel’s launch today of the Intel® Xeon® Processor E5-2600 Product Family. The Register has a nice article detailing some of the features of the new processor including this diagram of the processor.

Of particular interest to HPC customers should be Intel’s new AVX units on the cores which can do two floating point operations per cycle, twice that of existing Xeon X5600 processors. So while the FLOPS aren’t exactly free, you could say half the FLOPS are free compared to current Xeon processors. Not bad.

Look closer at the above chip diagram through. While the functional blocks are not drawn 100% to scale, you will still notice that the execution units, which include the AVX logic, actually take up a relatively small part of the processor. Much of the processor die is actually taken up by L1 and L2 cache and associated logic, out of order scheduling, and other advanced features that give the processor its overall performance and allows AVX to actually deliver double the number of FLOPS. On the Xeon E5-2600 HPC cluster that HP delivered to Purdue University back in October 2011, AVX in fact boosted Linpack performance from 149 GFLOPS/node (measured with AVX turned off) to 294 GFLOPS/node with AVX turned on and Intel’s MKL math library. That is pretty darn close to double.

The Purdue “Carter” system’s Top500 HPL score of 186.9 TF, listed ast #54 on the November 2011 Top500 list used only 257 KW of power which at the time was a record for a non-accelerated x86 system. Of course, even better performance/watt is possible using acceleration technology such as Nvidia’s Tesla GPUs, and upcoming Intel MIC and AMD Fusion APU technologies. So while a lot of hardware and software effort will continue to go into increasing the FLOPS/Watt of a processor, as you can imagine by looking at the block diagram above, an increasing amount of power in future general purpose processors like x86 will go to cache and other functions which I would group into “data movement” operations versus FLOPS operations.

Intel’s MIC compilers and technologies like OpenACC from Nvidia are likely to continue to improve the FLOPS you can get out of your floating point hardware, but even the best compilers can’t extract parallelism from code if the underlying algorithm is serial. The new challenge for software architects thus is rapidly changing from worrying about FLOPS to worrying about minimizing data movement (what all those other parts of the processor are mostly doing) which inherently requires you to think about the parallelism of your algorithm.

But for today, congratulations to Intel on their launch, with AVX and a host of other new HPC improvements, the E5-2600 is going to be a great processor for HPC. And of course, when coupled with HP’s new ProLiant Gen8 servers, and HPC networking, storage, and management software from HP, you have one of the world’s most self-sufficient and powerful HPC solutions. While HP announced a whole range of new Gen8 servers powered by the E5-2600 processor today, two of the specific servers designed from the ground up for HPC include the SL230s Gen8 and SL250s Gen8. A good place to start to learn more about the SL230s and SL250s is to click on “learn more” under the “See the Portfolio” banner on the ProLiant Gen8 launch page. HP has already shipped 1000’s of SL200 series servers as part of Intel’s “early ship” program including the 648 SL230s systems in the Purdue Carter cluster.

About Marc Hamilton

Marc Hamilton – Vice President, Solutions Architecture and Engineering, NVIDIA. At NVIDIA, the Visual Computing Company, Marc leads the worldwide Solutions Architecture and Engineering team, responsible for working with NVIDIA’s customers and partners to deliver the world’s best end to end solutions for professional visualization and design, high performance computing, and big data analytics. Prior to NVIDIA, Marc worked in the Hyperscale Business Unit within HP’s Enterprise Group where he led the HPC team for the Americas region. Marc spent 16 years at Sun Microsystems in HPC and other sales and marketing executive management roles. Marc also worked at TRW developing HPC applications for the US aerospace and defense industry. He has published a number of technical articles and is the author of the book, “Software Development, Building Reliable Systems”. Marc holds a BS degree in Math and Computer Science from UCLA, an MS degree in Electrical Engineering from USC, and is a graduate of the UCLA Executive Management program.
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to FLOPS Becoming Almost Free With Intel’s AVX

  1. Pingback: Marc Hamilton: FLOPS Almost Free With Intel’s AVX | insideHPC.com

  2. Wolfgang says:

    I’m interested in the new HP SL230s Gen8 server and I’m trying to compare prize/performance to the prior generation HP SL390s G7. Unfortunately there seems to be no way to determine the SL230 price with 2 x 10 GbE and 64 GB RAM, etc. All HP DL and SL models used to have a “customize configurable model” web link that would allow all configuration options to be selected, showing the detailed prize and totals. With the new website this configuration ability disappeared (except for being able to select the CPU frequency) and the resulting webapp is a step back, confusing and not facilitating non-transparent pricing. Any chance the Gen8 servers could retain the same “customize your model” web functionality? That would be great.

  3. Hi Wolfgang,
    Sorry you are having problems comparing models. We certainly are not trying to be non-transparent. Here are a couple of suggestions:
    – Use your favorite search engine and search for “HP SL230s Gen8 Models”, this will return several useful links including,
    This link compares a number of G7 and Gen8 models.
    In an attempt to simplify and standardize SL naming, the SL200 line is now used for so called “hyperscale” models addressing the HPC and scale-out web market. This has resulting in some naming changes from the G7 line with SL390s 1U G7 being replaced with SL230s Gen8 and SL390s 2U G7 being replaced with SL250s Gen8. I know, this isn’t super intuitive but in the long run the new SL200 series naming will be a simplification.
    A few other things to remember, especially for HPC use.
    – The G7 servers based on Xeon Westmere support 3 memory channels per socket so common memory configurations include 6 and 12 memory DIMMs, i.e. 6×4 GB or 12x4GB.
    – The G7 servers based on Xeon Sandybridge support 4 memory channels per socket so common memory configurations include 8 and 12 memory DIMMs, i.e. 6x8GB or 12x8GB.
    – For best performance you always want to use at least 1 memory DIMM per channel so while 48 GB and 96 GB where common G7 configurations this wouldn’t be an optimal Gen8 configuration

    I hope this helps. Please let me know if you have any additional questions about HP’s Gen8 servers or if you would like discuss different configurations directly with one of our HPC technical experts.


  4. Wolfgang says:

    Hi Marc,

    Thanks for the help, especially around more memory bandwidth via the additional memory channel. Much appreciated. For the Gen8 I’m considering 8x4GB or 8x8GB per socket, all with DDR3-1600.

    I finally managed to find the product customization web page I was looking for, although through Google rather than browsing around the HP website. Here it is: http://h71016.www7.hp.com/dstore/ctoBases.asp?ProductLineId=431&FamilyId=3537

    Note that this page cannot be reached from the main page you cited above, neither directly or indirectly. Perhaps something for the HP website team to look into. By the way, the same web site problem exists for other Gen8 server product lines as well.


    • Hi Wolfgang,
      Yes, our web page can sometimes be a bit challenging to navigate and something that is a priority to improve. On the memory, current US web site list pricing for a 4GB DDR3 4GB DIMM is $135 vs $219 for an 8GB DIMM, so from price/capacity point of view, 8 GB DIMMs are lower cost. In some cases you may get slightly better performance from 2x4GB DIMMs than from 1x8GB DIMM, but 2x4GB DIMM will also use more power. Another way to look at memory is that 8x4GB (32 GB) with two 8-core CPUs is giving you 2 GB/core and when you use hyperthreading and run 2 threads/core that drops you down to 1GB/thread which is not enough for many apps. With the new Sandybridge servers, from my somewhat random sampling of customers, it does seem like most customers are moving to at least 64 GB of memory.

Comments are closed.