Jason D. Clinton's blog: Utility Monsters - GCC Profile-guided optimization

For the past few work days, I have been running benchmarks on some scientific code using GCC + FFTW3, GCC + FFTW3 + Profiling, Intel Compiler (ICC) + FFTW3, ICC + MKLFFTW3, and ICC + MKLFFTW3 + Profiling. After compiling, the binaries are run on an AMD Shanghai and an Intel Stokley Harpertown—the top-of-the-line chips from each maker. Obviously, with the higher clock rate, there is a higher level of performance from the Intel chip. But does the total percentage increase in cost give an equal percentage of increase in performance? Across all compilers? The answer—surprisingly—is no, not by a long shot.

The first surprise was that GCC 4.1 (which is ancient at this point) with profile guided optimization just barely beat (less than 1% faster) the Intel compiler's profile guided optimization on this code. Even with the binaries specifically tuned for the Core2 arch, Intel could not overtake the generic code produced by GCC Profiler.

The second surprise was the vast difference in power versus performance versus cost. Yes, the AMD chips are slower. However, they consume, on the whole, drastically less power. And when choosing the AMD part that's at "the sweet spot"--where cost increase versus clockrate is not an exponential increase—versus the Intel part from their sweet spot, the AMD platform wins hands down.

The conclusion: If you are buying a lot of hardware to try to load to capacity a specific quantity of power, you get way more machines per quantity of available of power for a little bit more money with AMD. Though it is a little bit more money for those additional machines, the aggregate quantity of increased performance makes up for it three times over.

However, I suspect—based on the public benchmarks of Core i7—that all of these conclusions will be completely null-and-void as soon as the Core i7 architecture makes it to a server platform.

You may have noticed that I did not give any solid numbers above. That is intentional.

All these compiler optimizations made me wonder wonder how much of an affect this would have on Mesa software rendering for Gnome Shell. Based on my experiences here, I would guess that a good profiler run would increase Mesa software rendering performance by ~20%.