Hello,
Yesterday I spent a bit of time re-implementing some things that I had
taken out of mdgx to make the code cleaner. Turns out, they don't
interfere with any of the new functionality I've put in the code since
May, so bringing them back has been a real performance win. mdgx is now
on the heels of the CPU pmemd in terms of serial performance. I'm
convinced that the parallel performance is a matter of load-balancing at
this point.
What I did was re-introduce some "double steps" in the inner loop--I
calculate r2 for two inter-atom distances and then perform two interaction
calculations simultaneously with different local variables to saturate the
arithmetic units with each chip cycle. Then I looked at what pmemd does:
it computes all r2 values for one atom interacting with its neighbors and
stores the results, then applies them in subsequent loops. So I tried an
experiment with the attached C code and found that double-stepping in that
case didn't help. Is it the case that if you just have a long list of
arithmetic operations *without conditionals* then a good compiler will
just unroll the loop as it is able?
It would seem, then, that the real difference between mdgx and pmemd
performance is that I should be grouping similar calculations into as many
unbranched loops as I can and thereafter culling as many conditionals as
possible--is that roughly correct?
Dave
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Oct 20 2011 - 10:30:02 PDT