Re: amber-developers: amber performance

From: Robert Duke <>
Date: Thu, 1 Mar 2007 09:58:13 -0500

Hi Guys,
The subtitle for this email should probably be "let's not panic just
yet...". I pulled down the desmond 2007 paper again (thanks Ross, it was
lost on my desktop, and I have so many paper copies of stuff everywhere I
can't find anything) and took a hard look at the numbers for dhfr, which is
the JAC benchmark. I did not do a detailed read of the entire paper, just
parts; I'll read it all when I get a chance, but these guys are not
necessarily onto anything that is going to eat our lunch, in my opinion.
Their biggest advantage is they have enough people to both write software
and advertisements (journal articles, that would be). Okay, so we KNOW,
especially for something as small as JAC, that I have to fix the fft slab
distribution. We also know that they are playing fast and loose with
precision issues. I NEED you guys to be working very hard on validating
forcefields and proving that our methods work better than other methods, but
taking a very quick eyeball look at temperature drift "without constraints"
(I presume this means nve effectively), they drift temperature a whopping
2.0 degrees K per nsec (namd drifts 2.8 K per nsec). I was surprised by
this (I am looking at table 5), and maybe I don't understand something.
However, in testing out my cutoff methods, I did a bunch of NVE energy drift
checks. I did checks on a dna 12mer in a 70 angstrom box. For pme, I had a
0.21 degree drift in 5 nsec, or 0.04 degrees K per nsec. For one of my
earlier smooth cutoff methods (ie., not even my best method, based on
several metrics - but I unfortunately don't have T drift calculated on the
best method yet), I had the following temp drifts at the indicated cutoffs
in 5 nsec:

cutoff T drift in 5 nsec, "no constraints (means NVE to me)"
16 0.27
14 0.28
12 0.29
10 0.31
pme 0.21 (for reference)
desmond 10.0
namd 14.0

My system is smaller than theirs, but this is a huge difference; they
compare to namd throughout for everything including performance of course
but also quality of numbers.

Okay, what about the performance data? I was wrong in my previous email in
fretting about special h/w. This was influenced by the 2005 paper which I
did read more thoroughly. They are instead now targetting 2.4 GHz opterons,
a really good infiniband switch, mvapich (or lower level - lower level DOES
help them at high processor count). I have comparison numbers for a 2.2 GHz
dual opteron system, infiniband, mvapich. I don't have relative switch/nic
speed data, but with billions to spend, I am sure they have a good switches
and nics. So let's look at dhfr (JAC), adjust for the processor speed, and
compare the two opteron systems with pmemd 9 vs desmond, bearing in mind
that we know I need to fix the 2d fft slab distribution which impacts me on
something like JAC above 64 procs. I also show some sp5 numbers just for
fun, since I can scale higher there (better interconnect, and there is no
reason to believe that their infiniband implementation might be nearly as
hot). Here, "adj opteron" means I adjusted for the cpu speed differential
(desmond on 2.4 GHz opteron, pmemd was on 2.2 GHz opteron).

#proc desmond msec/step pmemd msec/step, adj
opteron pmemd msec/step, sp5

   8 42.9
43.5 44.3
  16 21.9
24.6 23.4
  32 12.1
13.0 12.2
  64 6.9
7.9 6.9
128 4.4
6.6 4.6
256 2.6
nd 3.9
512 2.3
nd nd

SO, they are NOT just tromping us from top to bottom on JAC, which remains
the best apples to apples comparison available (and we know we handle large
systems better than small systems, so JAC is a perfectly good hard scaling
case). I am also certain that my fft work will make a huge difference in
those last two columns. One thing at a time.

I am not panicked by these guys, and I have never been accused of being an
optimist. Never. My wife sometimes calls me "Mr. Little" (and yes, darnit,
the sky IS falling, and that should be "Dr. Little"). I need you guys to
get busy and show that the quality of our results matters. I'll fix the
performance. In the meantime, some acid tests of good results would be
handy for my cutoff-related work too, but by now I presume everyone realizes
I am not doing this to get rid of pme, but to better understand
electrostatics issues, have more options, perhaps apply to GB (someone
suggested that), etc.

Best Regards - Bob

----- Original Message -----
From: "Robert Duke" <>
To: <>
Sent: Wednesday, February 28, 2007 10:46 PM
Subject: Re: amber-developers: amber performance

> Probably one place where these guys get real traction is in going low
> level on the interconnect under mpi. NAMD also did this on lemieux, and
> it is a hard strategy to beat. Give me a low level network interface, and
> I can outperform mpi every time, period. But then I have to do it for
> every piece of hardware out there. So this is one major way desmond can
> fudge the problem. Thing is, they then will run on their own proprietary
> stuff, it would seem to me (or maybe they will target infiniband h/w; I
> may have seen something on this, I don't remember). A lot of the claims
> these guys have made don't completely add up to me. I think they are
> using fixed arithmetic for some things to get around using double
> precision. In my hands though, you don't gain much by going to single
> precision fp or 4 byte scaled integers - at the level of a compiler, dp is
> darn near as fast as sp these days for ia32. I think they use some sort
> of fancy rounding scheme in integers or bcd, claiming a gain - maybe with
> assembler, but I am skeptical that it is really worth it. The real
> problem to my mind, in dropping precision much is it becomes really hard
> to spot errors in the code. I have noticed this with gromacs in the past.
> On cutting the data size in communications, that really DOES NOT MATTER.
> It's the latency that kills you on a good interconnect, not the data
> throughput, and latency is invariant with data size. I would guess they
> have highly tuned specific h/w, some assembler, dropped precision, maybe
> not much else (some of their stuff like "neutral territory" is really hard
> to evaluate - we use what they would call a "half clam shell", and one
> very nice thing is that I can get good cache locality and reasonable
> spatial locality out of this. Now another factor. These guys at times
> are not doing pme. They also have their own gaussian split ewald pme
> variant when they are doing pme. I have been hoping to get some time to
> look at that, and see what the performance vs. accuracy issues are. We
> should, in my humble opinion, not completely flip out over these guys, but
> keep an eye on them. They have to have some serious talent - they have
> already collected 50 folks to do systems development, and they are
> scouting everywhere (if they have not tried to recruit you yet, give 'em
> time; me personally, you would have to kill me to get me to move to
> Manhatten). But there are advantages to being small. Give me a small
> number of folks who really know what they are doing, and they can run
> rings around a large dev team. Fred Brooks said this a long time ago.
> The communication nightmare gets bigger as you add people. What really
> worries me is the resources. These guys have billions of dollars, some
> fraction of which Shaw seems willing to let them play with. Hard to beat
> on grants. One thing this means is they can go nuts on things like
> fpga's, and if they recruit some serious low-level programming guys, I
> expect they can build a custom machine that screams. Personally, I can't
> imagine how they can keep that many people busy and spend that much money
> just building an md system - they must be attempting much more, surely. I
> need to look at their papers again, but last time I looked, I got the
> impression they were not beating the tar out of us, and I got the same
> impression from Ross. What we have to go for, it would seem to me, is
> being the system that produces the highest quality results in near-minimum
> time and the broadest range of truly useful functionality. And we let
> people see our code (which has a downside when you are competing purely on
> speed, but an upside when you are asking people to believe the numbers you
> spew out are real).
> Regards - Bob
> ----- Original Message -----
> From: "Adrian Roitberg" <>
> To: <>
> Sent: Wednesday, February 28, 2007 10:16 PM
> Subject: Re: amber-developers: amber performance
>> Yong Duan wrote:
>>> I'd be more interested in their energy conservation trajectories but
>>> can't
>>> find information. Neither could I find a particularly compelling novel
>>> technique to enable their absurdly impressive performance which is about
>>> one
>>> order of magnitude better than others.
>>> yong
>> I had a long chat with Istvan K at Sanibel about this (Desmond).
>> Basically, we cannot expect to see the code for a while. In about a year
>> they plan to release it as executable only, free for academics, to run
>> under Schrodinger's Maestro free interface.
>> Their claim about changes is related to the use of single precision,
>> which drops messaging in half, and the fact that they do not communicate
>> too far and only send stuff to nearest neighbors. Please do not start
>> commenting on this issues, I am just transmitting what he told me and
>> have little or no real expertise on this.
>> They also claim that they can do better arithmetic with single precision
>> than others with double but being 'very careful'. I do not know what this
>> means !
>> One thing they claim really helped is mapping the coordinates WITHIN a
>> single processor (or a unit cell ?) to a number between -1 and 1 (maybe 0
>> and 1). They can later one trivially correct for this. It helps them in
>> being able to use parts of the register for other stuff.
>> They also wrote their own low level communication routines from scratch.
>> This will kill portability of course.
>> A recent paper by that group "A common, avoidable source of error in
>> molecular dynamics integrators" in J. Chem. Phys. 126, 046101 (2007)
>> might help a bit, but I have only glanced at it.
>> a.
>> --
>> Dr. Adrian E. Roitberg
>> Associate Professor
>> Quantum Theory Project and Department of Chemistry
>> University of Florida PHONE 352 392-6972
>> P.O. Box 118435 FAX 352 392-8722
>> Gainesville, FL 32611-8435 Email
>> ============================================================================
>> To announce that there must be no criticism of the president,
>> or that we are to stand by the president right or wrong,
>> is not only unpatriotic and servile, but is morally treasonable
>> to the American public."
>> -- Theodore Roosevelt
Received on Sun Mar 04 2007 - 06:07:38 PST
Custom Search