Re: [AMBER-Developers] Anyone use restricted pointers in C? from dcerutti.rci.rutgers.edu on 2011-08-29 (Amber Developers Archive Aug 2011)

From: <dcerutti.rci.rutgers.edu>
Date: Mon, 29 Aug 2011 21:35:30 -0400 (EDT)

Yes, thanks for the feedback. I have talked at length with Bob on some of
these issues, and while I have not done a direct profiling of PMEMD and
mdgx, I did compare the codes early in the mdgx development. A lot of the
time-consuming routines have not changed significantly; mostly I've added
support for extra points and taken on the extra burden of doing the direct
space sum completely in a way that will be efficient if each sub-block of
the domain decomposition occurs on a separate processor. The
optimizations I made today were largely in mitigating the overhead that
the more stringent domain decomposition imposed, but I also found some
places in the particle<-->mesh routines which were optimizable. I think
I've improved that part of the code by 20-30%, which is good because in
the future I am thinking of adding a feature whereby all charges are
Gaussians of a constant width which map directly to the mesh. This shoves
the entire electrostatics calculation into reciprocal space and allows
elaborate virtual site constructions without adding significantly to the
computational cost.

But, as for your point about profiling the codes head-to-head, I think
that since the functions are different only the equivalent parts of the
code can really be compared. As I said, I did this early in the
development and the result was that, except in places where I know I was
doing things radically different from pmemd, mdgx was uniformly 25-30%
slower. In particular, even my particle-mesh routines (before today's
optimizations) were lagging pmemd by 25-30% despite the fact that the
algorithms were similar (and even then I had counted the number of
operations and found sneaky ways of streamlining things in mdgx). One
thing that helped a little (though not enough to totally bridge the gap)
was performing two direct-space interactions using different local
variables in each iteration of the inner loop (the final iteration taking
the last interaction if there was an odd number of them). The speedup was
about 5-7% overall, but it still wasn't beating pmemd in the equivalent
direct space computations and I removed it in favor of simpler code. I
save a little time on pairlist generation, and trade it back when I have
to compute a few more atom:atom distances than pmemd. In the long run I
have identical efficiencies with any time step whereas pmemd drops some
efficiency as the time step increases. Other than that, there's not much
I've been able to do that'll beat what Bob did.

I've tried to set up the memory in smart ways. I expect that the spatial
domain decomposition itself ensures that local regions of memory get
worked on one at a time. I also designed the "atom" structs that feed
into the direct space loop to take up 64 bytes of memory each: 3 doubles
for atom coordinates, three doubles for accumulating forces, one double
for charge (to make that readily accessible), one integer for indexing the
atom into an authoritative topology, and one integer for its Lennard-Jones
type (to make that readily accessible). So, if you think that this level
of memory management is not enough, and the only way to match what Bob did
is a lot of trial and error, that's not for me. I've put in a day's work
optimizing the code after a few months of heavy development, which is not
a bad investment while the cluster is offline. But, I don't plan on much
more unless someone can just glance through the code and offer a pearl of
wisdom on something general that I'm missing.

Dave

> Hi,
>
> Unsurprising results regarding restrict.
> I'm relatively clueless about exactly what you are doing, but :)
> Isn't the likely difference due to the use of the memory hierarchy?
> Bob spent a lot of time on cache performance tuning.
> Even before pmemd, differences between charmm and sander
> had a lot to do with the different memory layouts (although
> sander was doing some more math which may have explained its
> slower performance on some hardware).
>
> An interesting experiment would be to profile, using the
> hardware counters, the two codes on the same machine with
> the same compiler family. This should in principle classify
> the source(s) of the performance differences. However, going
> from that to realized code improvement could be a long project.
>
> scott

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon Aug 29 2011 - 19:00:02 PDT