amber-developers: NSF Petascale RFP - Some amuzing points. from Ross Walker on 2006-06-18 (Amber Developers Archive Jun 2006)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sun, 18 Jun 2006 21:22:37 -0700

Hi All,

I thought some of you might find part of NSF's latest RFP for a $300 million
petaflop machine to be somewhat amusing... As part of the RFP there is a
section that lists 3 specific simulations that vendors must provide
estimates for the performance of their proposed petaflop machines. This
sections lists specific targets for each of the 3 simulations that the NSF
expects the machines to be able to run. It says nothing about modifying the
codes specifically for the new machines. Anyway, here is one I think you
might all be interested in:

"* A molecular dynamics (MD) simulation of curvature-inducing protein
BAR domains binding to a charged phospholipid vesicle over 10 ns
simulation time under periodic boundary conditions. The vesicle, 100
nm in diameter, should consist of a mixture of
dioleoylphosphatidylcholine (DOPC) and dioleoylphosphatidylserine
(DOPS) at a ratio of 2:1. The entire system should consist of 100,000
lipids and 1000 BAR domains solvated in 30 million water molecules,
with NaCl also included at a concentration of 0.15 M, for a total
system size of 100 million atoms. All system components should be
modeled using the CHARMM27 all-atom empirical force field. The target
wall-clock time for completion of the model problem using the NAMD MD
package with the velocity Verlet time-stepping algorithm, Langevin
dynamics temperature coupling, Nose-Hoover Langevin piston pressure
control, the Particle Mesh Ewald algorithm with a tolerance of 1.0e-6
for calculation of electrostatics, a short-range (van der Waals)
cut-off of 12 Angstroms, and a time step of 0.002 ps, with 64-bit
floating point (or similar) arithmetic, is 25 hours. The positions,
velocities, and forces of all the atoms should be saved to disk every
500 timesteps."

HHHmmm, interesting... Well I tried a kind of back of the envelope
calculation for this. I setup a test simulation on a 408,000 atom system
using as close as I could get to the specs of the calculation given above in
PMEMD (which typically performs about 10 to 15% quicker than NAMD in my
experience). Here is the input file I used:

equilibration
&cntrl
   nstlim=1000,dt=0.002,es_cutoff=8.0,
   vdw_cutoff=12.0,
   ntc=2, ntf=2, tol=0.000001,
   ntx=5, irest=1, ntpr=500,
   ntt=3, gamma_ln=2.0,
   ntb=2,ntp=1,taup=2.0,
   ntwr=0, ntwx=500, ntwv=-1, ioutfm=1
/
&ewald
  dsum_tol=0.000001
/

I ran this on a single cpu 1.7GHz power4 machine which has a peak flop
rating of 6.8 GFlops. This calculation took 9602.61 seconds to run. So 10ns
would take 48013050 seconds on 6.8GFlops. So assuming that we could achieve
100% scaling on any number of cpus to get this calculation done in 25 hours
would require:

6.8*48013050/(25*3600) = 3.627 TFlops.

Now PME scales as N ln N so we would expect a 100 million atom simulation to
take approximately (N=100*10^6/408000) = 245.1 * ln 245.1 = 1350 times more
computation. Hence we would need 3.627 * 1350 = 4.9 Petaflops assuming 100%
perfect scaling....

Whoops!!! Anybody want to volunteer to write the code to do the specified
calculation in 25 hours on a machine of only 1 peta flop... ;-)

Then if you want more laughs you can look at the I/O. They want the full
coordinates, velocities and forces (why the forces I don't know) written
every 500 steps. So for 10 ns you would write a total of 10,000 frames. Each
frame will be 100*10^6*8*3 bytes long and we need 3 frames per step (C,V and
F) = 6.7GB per frame = 65.5 Terabytes.

This is not an inordinate amount but if we consider that with NAMD only the
master thread writes files (I guess we will have to assume that a full
distributed I/O implementation can be written) and we allow say a generous
5% of the calculation time for writing to disk (which considering we need
about 400% scaling to hit our target as it is is probably over generous ;-)
) then we would have to write 65.5 terabytes in 1.25 hours of master cpu
time. This equates to a bandwidth to disk from the master node only of 14.9
GB/sec. Since each write would also require a mpi_reduce we would also need
14.9GB/sec of bandwidth on the backplane to the master...

So who wants to volunteer the code to tackle this problem???

Have fun...
Ross
Received on Wed Jun 21 2006 - 06:07:08 PDT