Re: [AMBER-Developers] Anyone use restricted pointers in C?

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Mon, 29 Aug 2011 23:00:00 -0400

Hi,

Ok, i agree that youve tried to layout the data structures in
memory smart ways (although they could still be unoptimal).

Yes the experiment i suggest must be head-to-head on only equivalent codes.
Note that hardware counter profiling gives information on L1 cache
misses, prefetching, issued loads, branch prediction, etc.
Thus, this is much different that gprof profiling.
The idea would be to compare counter outputs, find a counter that is
sig. different between the codes, and read the codes to look for
an obvious cause. Getting up to speed on hardware profiling
is already more than a day's work. And in practice this simple
scenario may not materialize, eg, for charmm and amber i did this,
but there were tooo many differences to untangle a simple cause
and effect. Sadly, i don't have time to volunteer.

One minor point on restrict that you probably already know;
for a zeroth order test to determine whether restrict could help,
be very generous with restrict (at the expense of code correctness);
it is sometimes very easy to miss a dependency that is preventing
an optimization. Maybe there are tools available to help, but Im
too stale to know.

Ohh yeah, valgrind has a cache profiler. I havent used it, but
valgrind is great software, so it might be a quick way to enter
the hardware profiler waters.

scott

On Mon, Aug 29, 2011 at 09:35:30PM -0400, dcerutti.rci.rutgers.edu wrote:
> Yes, thanks for the feedback. I have talked at length with Bob on some of
> these issues, and while I have not done a direct profiling of PMEMD and
> mdgx, I did compare the codes early in the mdgx development. A lot of the
> time-consuming routines have not changed significantly; mostly I've added
> support for extra points and taken on the extra burden of doing the direct
> space sum completely in a way that will be efficient if each sub-block of
> the domain decomposition occurs on a separate processor. The
> optimizations I made today were largely in mitigating the overhead that
> the more stringent domain decomposition imposed, but I also found some
> places in the particle<-->mesh routines which were optimizable. I think
> I've improved that part of the code by 20-30%, which is good because in
> the future I am thinking of adding a feature whereby all charges are
> Gaussians of a constant width which map directly to the mesh. This shoves
> the entire electrostatics calculation into reciprocal space and allows
> elaborate virtual site constructions without adding significantly to the
> computational cost.
>
> But, as for your point about profiling the codes head-to-head, I think
> that since the functions are different only the equivalent parts of the
> code can really be compared. As I said, I did this early in the
> development and the result was that, except in places where I know I was
> doing things radically different from pmemd, mdgx was uniformly 25-30%
> slower. In particular, even my particle-mesh routines (before today's
> optimizations) were lagging pmemd by 25-30% despite the fact that the
> algorithms were similar (and even then I had counted the number of
> operations and found sneaky ways of streamlining things in mdgx). One
> thing that helped a little (though not enough to totally bridge the gap)
> was performing two direct-space interactions using different local
> variables in each iteration of the inner loop (the final iteration taking
> the last interaction if there was an odd number of them). The speedup was
> about 5-7% overall, but it still wasn't beating pmemd in the equivalent
> direct space computations and I removed it in favor of simpler code. I
> save a little time on pairlist generation, and trade it back when I have
> to compute a few more atom:atom distances than pmemd. In the long run I
> have identical efficiencies with any time step whereas pmemd drops some
> efficiency as the time step increases. Other than that, there's not much
> I've been able to do that'll beat what Bob did.
>
> I've tried to set up the memory in smart ways. I expect that the spatial
> domain decomposition itself ensures that local regions of memory get
> worked on one at a time. I also designed the "atom" structs that feed
> into the direct space loop to take up 64 bytes of memory each: 3 doubles
> for atom coordinates, three doubles for accumulating forces, one double
> for charge (to make that readily accessible), one integer for indexing the
> atom into an authoritative topology, and one integer for its Lennard-Jones
> type (to make that readily accessible). So, if you think that this level
> of memory management is not enough, and the only way to match what Bob did
> is a lot of trial and error, that's not for me. I've put in a day's work
> optimizing the code after a few months of heavy development, which is not
> a bad investment while the cluster is offline. But, I don't plan on much
> more unless someone can just glance through the code and offer a pearl of
> wisdom on something general that I'm missing.
>
> Dave
>
> > Unsurprising results regarding restrict.
> > I'm relatively clueless about exactly what you are doing, but :)
> > Isn't the likely difference due to the use of the memory hierarchy?
> > Bob spent a lot of time on cache performance tuning.
> > Even before pmemd, differences between charmm and sander
> > had a lot to do with the different memory layouts (although
> > sander was doing some more math which may have explained its
> > slower performance on some hardware).
> >
> > An interesting experiment would be to profile, using the
> > hardware counters, the two codes on the same machine with
> > the same compiler family. This should in principle classify
> > the source(s) of the performance differences. However, going
> > from that to realized code improvement could be a long project.

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon Aug 29 2011 - 20:30:02 PDT
Custom Search