Re: [AMBER-Developers] Grouping calculations together... is that what makes things go fast? from dcerutti.rci.rutgers.edu on 2011-10-23 (Amber Developers Archive Oct 2011)

From: <dcerutti.rci.rutgers.edu>
Date: Sun, 23 Oct 2011 14:11:12 -0400 (EDT)

Many thanks to all who have responded here; I spent a little more time
yesterday trying to get all this figured out. The caching of r2 values
and displacements for successful pairs that Bob introduced is now
implemented; surprisingly, it doesn't always beat my earlier
implementation despite all the branching that that thing did! However, it
does make the code somewhat more understandable, and the new
implementation supports what I want to do in some other ways so I will
keep this one.

I STILL had to manually replicate a lot of code to make this stuff go
faster--I got about 10% speedup by quad-stepping the loop in the R2
calculation and then some more by double-stepping the loops for various
types of interactions. The flow of information is readable, but it's a
familiar complaint: "why did they go and manually unroll 1000 lines of
code?" Double-stepping the r2 calculation pre-loop doesn't make things
look too bad; someone might want to give that a shot in pmemd.

One thing that I didn't see helping at all was SSE... currently "make
install" on mdgx will use -ip -O3 only... I tried adding -xW -w -tpp7 -SSE
to those flags and saw zero change. -vec-report output from the Intel
compiler shows vectorization happening in some places but none at all in
the nonbonded routines where I've been working lately.

Finally, I tried the JAC benchmark; the typical serial PMEMD installation
gets 1300 seconds, the serial mdgx 1646. So, PMEMD is still about 30%
faster on something like that. The difference would be more pronounced if
the time step were 1fs, as PMEMD pairlist rebuilding would happen less
frequently and the time step size won't affect mdgx at all, and if the
serial PMEMD were compiled with fftw, which would give it another 5-7%
speed advantage. It seems that my on-the-fly pairlist building scheme is
still sensitive to cases where the subdomains do not tightly match the
size of the cutoff; I tried some cubic water boxes (TIP3P and TIP4P) which
were sized so that the subdomains were almost exactly the length of the
cutoff and the latest mdgx is neck-and-neck with pmemd in those cases.

Enough of optimization, now to make mdgx do original things.

Dave

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Oct 23 2011 - 11:30:02 PDT