If you look at the low-level communication lib vs standard MPI (table 2 of
their paper), the difference is not that much. I also think the sp and dp do
not make too much difference on Opteron (which is a 64-bit native machine).
But they had a new approach on 3D FFT. They did not do the usual "slab" 3D
FFT. Instead, they simply did the 3DFFT in place (page 6). With the spatial
decomposition, they only need small number of comms for each fft. As for the
spatial decomposition, they did not "exchange" the particles every step.
Rather, only every four steps (or so). All these may be helpful to Bob and
Mike. 
But I still don't see how could this make such a huge difference.
yong
-----Original Message-----
From: owner-amber-developers.scripps.edu
[mailto:owner-amber-developers.scripps.edu] On Behalf Of Robert Duke
Sent: Wednesday, February 28, 2007 7:47 PM
To: amber-developers.scripps.edu
Subject: Re: amber-developers: amber performance
Probably one place where these guys get real traction is in going low level 
on the interconnect under mpi.  NAMD also did this on lemieux, and it is a 
hard strategy to beat.  Give me a low level network interface, and I can 
outperform mpi every time, period.  But then I have to do it for every piece
of hardware out there.  So this is one major way desmond can fudge the 
problem.  Thing is, they then will run on their own proprietary stuff, it 
would seem to me (or maybe they will target infiniband h/w; I may have seen 
something on this, I don't remember).  A lot of the claims these guys have 
made don't completely add up to me.  I think they are using fixed arithmetic
for some things to get around using double precision.  In my hands though, 
you don't gain much by going to single precision fp or 4 byte scaled 
integers - at the level of a compiler, dp is darn near as fast as sp these 
days for ia32.  I think they use some sort of fancy rounding scheme in 
integers or bcd, claiming a gain - maybe with assembler, but I am skeptical 
that it is really worth it.  The real problem to my mind, in dropping 
precision much is it becomes really hard to spot errors in the code.  I have
noticed this with gromacs in the past.  On cutting the data size in 
communications, that really DOES NOT MATTER.  It's the latency that kills 
you on a good interconnect, not the data throughput, and latency is 
invariant with data size.  I would guess they have highly tuned specific 
h/w, some assembler, dropped precision, maybe not much else (some of their 
stuff like "neutral territory" is really hard to evaluate - we use what they
would call a "half clam shell", and one very nice thing is that I can get 
good cache locality and reasonable spatial locality out of this.  Now 
another factor.  These guys at times are not doing pme.  They also have 
their own gaussian split ewald pme variant when they are doing pme.  I have 
been hoping to get some time to look at that, and see what the performance 
vs. accuracy issues are.  We should, in my humble opinion, not completely 
flip out over these guys, but keep an eye on them.  They have to have some 
serious talent - they have already collected 50 folks to do systems 
development, and they are scouting everywhere (if they have not tried to 
recruit you yet, give 'em time; me personally, you would have to kill me to 
get me to move to Manhatten).  But there are advantages to being small. 
Give me a small number of folks who really know what they are doing, and 
they can run rings around a large dev team.  Fred Brooks said this a long 
time ago.  The communication nightmare gets bigger as you add people.  What 
really worries me is the resources.  These guys have billions of dollars, 
some fraction of which Shaw seems willing to let them play with.  Hard to 
beat on grants.  One thing this means is they can go nuts on things like 
fpga's, and if they recruit some serious low-level programming guys, I 
expect they can build a custom machine that screams.  Personally, I can't 
imagine how they can keep that many people busy and spend that much money 
just building an md system - they must be attempting much more, surely.  I 
need to look at their papers again, but last time I looked, I got the 
impression they were not beating the tar out of us, and I got the same 
impression from Ross.  What we have to go for, it would seem to me, is being
the system that produces the highest quality results in near-minimum time 
and the broadest range of truly useful functionality.  And we let people see
our code (which has a downside when you are competing purely on speed, but 
an upside when you are asking people to believe the numbers you spew out are
real).
Regards - Bob
----- Original Message ----- 
From: "Adrian Roitberg" <roitberg.qtp.ufl.edu>
To: <amber-developers.scripps.edu>
Sent: Wednesday, February 28, 2007 10:16 PM
Subject: Re: amber-developers: amber performance
> Yong Duan wrote:
>> I'd be more interested in their energy conservation trajectories but 
>> can't
>> find information. Neither could I find a particularly compelling novel
>> technique to enable their absurdly impressive performance which is about 
>> one
>> order of magnitude better than others.
>>
>>
>> yong
> I had a long chat with Istvan K at Sanibel about this (Desmond). 
> Basically, we cannot expect to see the code for a while. In about a year 
> they plan to release it as executable only, free for academics, to run 
> under Schrodinger's Maestro free interface.
>
> Their claim about changes is related to the use of single precision,
> which drops messaging in half, and the fact that they do not communicate 
> too far and only send stuff to nearest neighbors. Please do not start 
> commenting on this issues, I am just transmitting what he told me and have
> little or no real expertise on this.
>
> They also claim that they can do better arithmetic with single precision 
> than others with double but being 'very careful'. I do not know what this 
> means !
>
> One thing they claim really helped is mapping the coordinates WITHIN a 
> single processor (or a unit cell ?) to a number between -1 and 1 (maybe 0 
> and 1). They can later one trivially correct for this. It helps them in 
> being able to use parts of the register for other stuff.
>
> They also wrote their own low level communication routines from scratch. 
> This will kill portability of course.
>
> A recent paper by that group "A common, avoidable source of error in 
> molecular dynamics integrators" in J. Chem. Phys. 126, 046101 (2007) might
> help a bit, but I have only glanced at it.
>
> a.
>
> -- 
>                            Dr. Adrian E. Roitberg
>                              Associate Professor
>               Quantum Theory Project and Department of Chemistry
>
> University of Florida                         PHONE 352 392-6972
> P.O. Box 118435                               FAX   352 392-8722
> Gainesville, FL 32611-8435                    Email adrian.qtp.ufl.edu
>
============================================================================
>
> To announce that there must be no criticism of the president,
> or that we are to stand by the president right or wrong,
> is not only unpatriotic and servile, but is morally treasonable
> to the American public."
>   --  Theodore Roosevelt
>
> 
Received on Sun Mar 04 2007 - 06:07:30 PST