Re: [AMBER-Developers] Grouping calculations together... is that what makes things go fast? from Ross Walker on 2011-10-20 (Amber Developers Archive Oct 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 20 Oct 2011 10:29:32 -0700

Hi Dave,

> case didn't help. Is it the case that if you just have a long list of
> arithmetic operations *without conditionals* then a good compiler will
> just unroll the loop as it is able?

In my experience. Yes. Intel and AMD went a long way with super long
pipelines and branch prediction that negated a lot of the issues with having
if statements in loops as long as those if statements were not deeply nested
and the code followed one path 90% or more of the time. Then the if
statement doesn't hurt too much. In reality though most of the performance
for floating point on current chips comes from the SSE registers. The chips
mostly look like vector chips of old (albeit without the memory bandwidth)
and so in my experience you can benefit a lot by writing vector code.
Sometimes it can help to block things up in terms of cache sizes allowing
the prefetch units to hide some of the memory latency. This can be a pain in
the butt though since you have to tune it for each different chip and cache
size. Ultimately though for simplicity unless you want to try to avoid a lot
of computational effort by using things like lookup tables, is to just write
the code as a bunch of small extremely simple loops.

Often it can help to put a required if statement, such as one needed for a
cutoff in 'pre-loop' that does nothing but check what passes the if
statement and pull the relevant data into a cache array. Then you have
vectored loops that essentially run over the entire cached array. You can
then use things like MKL to do vectored inverse square roots etc to speed
things up. Although it makes little difference if you use the intel
compilers since with the -x / -SSE options they just use their own internal
vector library (probably identical to MKL under the hood) to auto vectorise
your loops if they can.

> It would seem, then, that the real difference between mdgx and pmemd
> performance is that I should be grouping similar calculations into as
> many
> unbranched loops as I can and thereafter culling as many conditionals
> as
> possible--is that roughly correct?

Yes. Bob might be able to comment in more detail but my suggestion would be
to put all the conditions you cannot possibly get rid of into a loop that
does not real computation. Just fills 1D vector caches. Then do all the
computation in subsequent loops.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Oct 20 2011 - 10:30:03 PDT