RE: amber-developers: amber performance from Yong Duan on 2007-02-28 (Amber Developers Archive Feb 2007)

From: Yong Duan <duan.ucdavis.edu>
Date: Wed, 28 Feb 2007 20:39:32 -0800

If you look at the low-level communication lib vs standard MPI (table 2 of
their paper), the difference is not that much. I also think the sp and dp do
not make too much difference on Opteron (which is a 64-bit native machine).

But they had a new approach on 3D FFT. They did not do the usual "slab" 3D
FFT. Instead, they simply did the 3DFFT in place (page 6). With the spatial
decomposition, they only need small number of comms for each fft. As for the
spatial decomposition, they did not "exchange" the particles every step.
Rather, only every four steps (or so). All these may be helpful to Bob and
Mike.

But I still don't see how could this make such a huge difference.

yong

-----Original Message-----
From: owner-amber-developers.scripps.edu
[mailto:owner-amber-developers.scripps.edu] On Behalf Of Robert Duke
Sent: Wednesday, February 28, 2007 7:47 PM
To: amber-developers.scripps.edu
Subject: Re: amber-developers: amber performance

Probably one place where these guys get real traction is in going low level
on the interconnect under mpi. NAMD also did this on lemieux, and it is a
hard strategy to beat. Give me a low level network interface, and I can
outperform mpi every time, period. But then I have to do it for every piece

of hardware out there. So this is one major way desmond can fudge the
problem. Thing is, they then will run on their own proprietary stuff, it
would seem to me (or maybe they will target infiniband h/w; I may have seen
something on this, I don't remember). A lot of the claims these guys have
made don't completely add up to me. I think they are using fixed arithmetic

for some things to get around using double precision. In my hands though,
you don't gain much by going to single precision fp or 4 byte scaled
integers - at the level of a compiler, dp is darn near as fast as sp these
days for ia32. I think they use some sort of fancy rounding scheme in
integers or bcd, claiming a gain - maybe with assembler, but I am skeptical
that it is really worth it. The real problem to my mind, in dropping
precision much is it becomes really hard to spot errors in the code. I have

noticed this with gromacs in the past. On cutting the data size in
communications, that really DOES NOT MATTER. It's the latency that kills
you on a good interconnect, not the data throughput, and latency is
invariant with data size. I would guess they have highly tuned specific
h/w, some assembler, dropped precision, maybe not much else (some of their
stuff like "neutral territory" is really hard to evaluate - we use what they

would call a "half clam shell", and one very nice thing is that I can get
good cache locality and reasonable spatial locality out of this. Now
another factor. These guys at times are not doing pme. They also have
their own gaussian split ewald pme variant when they are doing pme. I have
been hoping to get some time to look at that, and see what the performance
vs. accuracy issues are. We should, in my humble opinion, not completely
flip out over these guys, but keep an eye on them. They have to have some
serious talent - they have already collected 50 folks to do systems
development, and they are scouting everywhere (if they have not tried to
recruit you yet, give 'em time; me personally, you would have to kill me to
get me to move to Manhatten). But there are advantages to being small.
Give me a small number of folks who really know what they are doing, and
they can run rings around a large dev team. Fred Brooks said this a long
time ago. The communication nightmare gets bigger as you add people. What
really worries me is the resources. These guys have billions of dollars,
some fraction of which Shaw seems willing to let them play with. Hard to
beat on grants. One thing this means is they can go nuts on things like
fpga's, and if they recruit some serious low-level programming guys, I
expect they can build a custom machine that screams. Personally, I can't
imagine how they can keep that many people busy and spend that much money
just building an md system - they must be attempting much more, surely. I
need to look at their papers again, but last time I looked, I got the
impression they were not beating the tar out of us, and I got the same
impression from Ross. What we have to go for, it would seem to me, is being

the system that produces the highest quality results in near-minimum time
and the broadest range of truly useful functionality. And we let people see

our code (which has a downside when you are competing purely on speed, but
an upside when you are asking people to believe the numbers you spew out are

real).
Regards - Bob

----- Original Message -----
From: "Adrian Roitberg" <roitberg.qtp.ufl.edu>
To: <amber-developers.scripps.edu>
Sent: Wednesday, February 28, 2007 10:16 PM
Subject: Re: amber-developers: amber performance

> Yong Duan wrote:
>> I'd be more interested in their energy conservation trajectories but
>> can't
>> find information. Neither could I find a particularly compelling novel
>> technique to enable their absurdly impressive performance which is about
>> one
>> order of magnitude better than others.
>>
>>
>> yong
> I had a long chat with Istvan K at Sanibel about this (Desmond).
> Basically, we cannot expect to see the code for a while. In about a year
> they plan to release it as executable only, free for academics, to run
> under Schrodinger's Maestro free interface.
>
> Their claim about changes is related to the use of single precision,
> which drops messaging in half, and the fact that they do not communicate
> too far and only send stuff to nearest neighbors. Please do not start
> commenting on this issues, I am just transmitting what he told me and have

> little or no real expertise on this.
>
> They also claim that they can do better arithmetic with single precision
> than others with double but being 'very careful'. I do not know what this
> means !
>
> One thing they claim really helped is mapping the coordinates WITHIN a
> single processor (or a unit cell ?) to a number between -1 and 1 (maybe 0
> and 1). They can later one trivially correct for this. It helps them in
> being able to use parts of the register for other stuff.
>
> They also wrote their own low level communication routines from scratch.
> This will kill portability of course.
>
> A recent paper by that group "A common, avoidable source of error in
> molecular dynamics integrators" in J. Chem. Phys. 126, 046101 (2007) might

> help a bit, but I have only glanced at it.
>
> a.
>
> --
> Dr. Adrian E. Roitberg
> Associate Professor
> Quantum Theory Project and Department of Chemistry
>
> University of Florida PHONE 352 392-6972
> P.O. Box 118435 FAX 352 392-8722
> Gainesville, FL 32611-8435 Email adrian.qtp.ufl.edu
>
============================================================================
>
> To announce that there must be no criticism of the president,
> or that we are to stand by the president right or wrong,
> is not only unpatriotic and servile, but is morally treasonable
> to the American public."
> -- Theodore Roosevelt
>
>
Received on Sun Mar 04 2007 - 06:07:30 PST