Re: [AMBER-Developers] Grouping calculations together... is that what makes things go fast? from Duke, Robert E Jr on 2011-10-20 (Amber Developers Archive Oct 2011)

From: Duke, Robert E Jr <rduke.email.unc.edu>
Date: Fri, 21 Oct 2011 05:15:56 +0000

Hi Dave,
Ross pretty much hit the high points. Branch prediction works very well on the Intel architecture (non-Itanium); these chips are indeed very heavily pipelined, but nothing like the Itanium. A branch misprediction on the Itanium is murder. Fortunately, the Itanium is about gone from the HPC scene, as it optimizes differently than everything else. I think the metric I have read for the Intel chips that are now everywhere is that if you get branch prediction right 95% or more of the time, the branching has nil impact. The other factor, as Ross says, is that these chips are significantly faster if they can utilize vector instructions. So I think I pioneered the scheme he mentions of batching up data in vector caches (in Amber at any rate). I believe what happens is that the mispredictions on data movement are relatively cheap, and then you loop with no misprediction over the math instruction set, and in addition the data is laid out in such a manner that the compiler will take full advantage of the vector
instruction set. Re-use of data space on the stack is typically useful also - cuts down on memory access times. Lots of other things can be said about tricks for non-parallel performance, but compilers these days mostly cut the work load for the programmer; kind of takes all the fun out. Early on in pmemd dev, I figured out a big performance win that had to do with making some code changes that significantly reduced cache misses; this sort of thing is less important now, as chip cache sizes have gotten pretty big; still something to think about when you deal with lots of data, though. When pmemd was first written, the chip sets Amber ran on were much more diverse than now, so I actually opted to write a range of implementations, and then tested each chip set on all implementations. We had 4 basic groups of chips - Intel architecture (including AMD Opteron), Itanium, various RISC chips with subtle issues, and the IBM SPx chips. The IBM chips tended to optimize a lot like the Intel architecture, but the
Itanium and RISC chips would throw you a curve. So I think I basically had something like 12 different ways you could compile the most performance-critical code, and I tried them all on the different machines. Things should be significantly simpler now. I would say that even with the multiple architectures, the serial optimization was on the order of 10-20% of the work; lots more time was spent on parallel issues.
Regards - Bob

________________________________________
From: dcerutti.rci.rutgers.edu [dcerutti.rci.rutgers.edu]
Sent: Thursday, October 20, 2011 10:15 AM
To: amber-developers.ambermd.org
Subject: [AMBER-Developers] Grouping calculations together... is that what makes things go fast?

Hello,

Yesterday I spent a bit of time re-implementing some things that I had
taken out of mdgx to make the code cleaner. Turns out, they don't
interfere with any of the new functionality I've put in the code since
May, so bringing them back has been a real performance win. mdgx is now
on the heels of the CPU pmemd in terms of serial performance. I'm
convinced that the parallel performance is a matter of load-balancing at
this point.

What I did was re-introduce some "double steps" in the inner loop--I
calculate r2 for two inter-atom distances and then perform two interaction
calculations simultaneously with different local variables to saturate the
arithmetic units with each chip cycle. Then I looked at what pmemd does:
it computes all r2 values for one atom interacting with its neighbors and
stores the results, then applies them in subsequent loops. So I tried an
experiment with the attached C code and found that double-stepping in that
case didn't help. Is it the case that if you just have a long list of
arithmetic operations *without conditionals* then a good compiler will
just unroll the loop as it is able?

It would seem, then, that the real difference between mdgx and pmemd
performance is that I should be grouping similar calculations into as many
unbranched loops as I can and thereafter culling as many conditionals as
possible--is that roughly correct?

Dave
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Oct 20 2011 - 22:30:02 PDT