Re: [AMBER-Developers] Grouping calculations together... is that what makes things go fast?

From: Scott Brozell <>
Date: Thu, 20 Oct 2011 17:16:36 -0400


Executive summary: Ross and i are in agreement !

On Thu, Oct 20, 2011 at 10:29:32AM -0700, Ross Walker wrote:
> Hi Dave,
> > case didn't help. Is it the case that if you just have a long list of
> > arithmetic operations *without conditionals* then a good compiler will
> > just unroll the loop as it is able?
> In my experience. Yes. Intel and AMD went a long way with super long
> pipelines and branch prediction that negated a lot of the issues with having
> if statements in loops as long as those if statements were not deeply nested
> and the code followed one path 90% or more of the time. Then the if
> statement doesn't hurt too much. In reality though most of the performance
> for floating point on current chips comes from the SSE registers. The chips
> mostly look like vector chips of old (albeit without the memory bandwidth)
> and so in my experience you can benefit a lot by writing vector code.
> Sometimes it can help to block things up in terms of cache sizes allowing
> the prefetch units to hide some of the memory latency. This can be a pain in
> the butt though since you have to tune it for each different chip and cache
> size. Ultimately though for simplicity unless you want to try to avoid a lot
> of computational effort by using things like lookup tables, is to just write
> the code as a bunch of small extremely simple loops.

I think it's fair to say now that all processors are both
superscalar, executing multiple instructions per cycle, and
superpipelined, having lots of stages in the execution of 1 instruction.
Maybe 4 issue superscalar and a dozen stages in each pipe
although i don't follow this regularly and just goooggled to
get these numbers.
This means that the primary source of the difference in opinion
of Ross and me has disappeared:
5-10 years ago the pentium processors were superpipelined
and the mips processors were superscalar.
So 1 if-statement could wreak havoc on the pentiums by disrupting
a very long 20+ stage pipe, but that if-statement may have had NO
measurable effect on a mips with its shorter ~6 stages pipes, better
instruction scheduling, software pipelining, etc.
All the current processors use techniques, like out of order
execution, to hide disruptions in the pipes.

> Often it can help to put a required if statement, such as one needed for a
> cutoff in 'pre-loop' that does nothing but check what passes the if
> statement and pull the relevant data into a cache array. Then you have
> vectored loops that essentially run over the entire cached array. You can
> then use things like MKL to do vectored inverse square roots etc to speed
> things up. Although it makes little difference if you use the intel
> compilers since with the -x / -SSE options they just use their own internal
> vector library (probably identical to MKL under the hood) to auto vectorise
> your loops if they can.

> > It would seem, then, that the real difference between mdgx and pmemd
> > performance is that I should be grouping similar calculations into as
> > many
> > unbranched loops as I can and thereafter culling as many conditionals
> > as
> > possible--is that roughly correct?

Yes. If my analysis above is correct then current processors should
be less sensitive to if-satements than pentium 4's and more able
to crunch data parallely. So as Ross points out above floating
point performance is key - keeping all those superscalar
(vector like) pipes humming.


> Yes. Bob might be able to comment in more detail but my suggestion would be
> to put all the conditions you cannot possibly get rid of into a loop that
> does not real computation. Just fills 1D vector caches. Then do all the
> computation in subsequent loops.

AMBER-Developers mailing list
Received on Thu Oct 20 2011 - 14:30:04 PDT
Custom Search