Re: [AMBER-Developers] Grouping calculations together... is that what makes things go fast? from Scott Brozell on 2011-10-23 (Amber Developers Archive Oct 2011)

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Sun, 23 Oct 2011 15:47:30 -0400

Hi,

On Sun, Oct 23, 2011 at 02:11:12PM -0400, dcerutti.rci.rutgers.edu wrote:
> Many thanks to all who have responded here; I spent a little more time
> yesterday trying to get all this figured out. The caching of r2 values
> and displacements for successful pairs that Bob introduced is now
> implemented; surprisingly, it doesn't always beat my earlier
> implementation despite all the branching that that thing did! However, it
> does make the code somewhat more understandable, and the new
> implementation supports what I want to do in some other ways so I will
> keep this one.

The lack of impact of branching on performance is probably evidence
that the x86 architectures have made substantial improvements
in out of order execution, branch prediction, stage reduction in
the pipeleines, etc.

> I STILL had to manually replicate a lot of code to make this stuff go
> faster--I got about 10% speedup by quad-stepping the loop in the R2
> calculation and then some more by double-stepping the loops for various
> types of interactions. The flow of information is readable, but it's a
> familiar complaint: "why did they go and manually unroll 1000 lines of
> code?" Double-stepping the r2 calculation pre-loop doesn't make things
> look too bad; someone might want to give that a shot in pmemd.
>
> One thing that I didn't see helping at all was SSE... currently "make
> install" on mdgx will use -ip -O3 only... I tried adding -xW -w -tpp7 -SSE
> to those flags and saw zero change. -vec-report output from the Intel
> compiler shows vectorization happening in some places but none at all in
> the nonbonded routines where I've been working lately.

your compiler options are dated. for example,
man ifort 10.0 says that it -tpp7 should be -mtune=pentium4
a quick gooooggglle finds
http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/

btw, what platform(s) are you testing ?

scott

> Finally, I tried the JAC benchmark; the typical serial PMEMD installation
> gets 1300 seconds, the serial mdgx 1646. So, PMEMD is still about 30%
> faster on something like that. The difference would be more pronounced if
> the time step were 1fs, as PMEMD pairlist rebuilding would happen less
> frequently and the time step size won't affect mdgx at all, and if the
> serial PMEMD were compiled with fftw, which would give it another 5-7%
> speed advantage. It seems that my on-the-fly pairlist building scheme is
> still sensitive to cases where the subdomains do not tightly match the
> size of the cutoff; I tried some cubic water boxes (TIP3P and TIP4P) which
> were sized so that the subdomains were almost exactly the length of the
> cutoff and the latest mdgx is neck-and-neck with pmemd in those cases.
>
> Enough of optimization, now to make mdgx do original things.
>

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Oct 23 2011 - 13:00:03 PDT