amber-developers: NEB is busted with ncpu /= nreplica (flaws in the parallel test cases)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 30 Apr 2007 13:53:51 -0700

Hi Wei, Francesco and others,

As a follow-up to what I sent you earlier (Wei) I just ran some tests and
have uncovered a problem with NEB and I likely suspect PIMD as well.

I ran 50 steps of alanine phi-psi rotation based on the heating phase of the
Amber 9 NEB tutorial but using Amber 10. This has 30 replicas. Running with
30 MPI tasks I get what looks like a reasonable output file. The energies
are reasonable and the energy of replica 1 and 30 remain fixed as they
should do. So after 50 steps we have:

 NSTEP = 50 TIME(PS) = 0.02500 TEMP(K) = 0.97 PRESS =
0.0
 Etot = 0.0000 EKtot = 0.0000 EPtot =
-923.8469
 BOND = 23.8927 ANGLE = 57.5721 DIHED =
270.9344
 1-4 NB = 88.2706 1-4 EEL = 1323.4884 VDWAALS =
-49.3114
 EELEC = -2086.8697 EGB = -551.8240 RESTRAINT =
0.0000
NEB replicate breakdown:
Energy for replicate 1 = -32.7646
Energy for replicate 2 = -32.7582
Energy for replicate 3 = -32.7582
Energy for replicate 4 = -32.7582
Energy for replicate 5 = -32.7582
Energy for replicate 6 = -32.7582
Energy for replicate 7 = -32.7582
Energy for replicate 8 = -32.7582
Energy for replicate 9 = -32.7582
Energy for replicate 10 = -32.7582
Energy for replicate 11 = -32.7582
Energy for replicate 12 = -32.7582
Energy for replicate 13 = -32.7580
Energy for replicate 14 = -32.7476
Energy for replicate 15 = -26.6800
Energy for replicate 16 = -26.1991
Energy for replicate 17 = -29.4524
Energy for replicate 18 = -29.4535
Energy for replicate 19 = -29.4535
Energy for replicate 20 = -29.4535
Energy for replicate 21 = -29.4535
Energy for replicate 22 = -29.4535
Energy for replicate 23 = -29.4535
Energy for replicate 24 = -29.4535
Energy for replicate 25 = -29.4535
Energy for replicate 26 = -29.4535
Energy for replicate 27 = -29.4535
Energy for replicate 28 = -29.4535
Energy for replicate 29 = -29.4535
Energy for replicate 30 = -29.4626
Total Energy of replicates = -923.8469
NEB RMS = 0.000000

Note the NEB RMS is still not working though and always shows 0.000000.

Now if we repeat the exact same calculation with 60 cpus (i.e. 2 per
replica) then we get garbage after 50 steps:


 NSTEP = 50 TIME(PS) = 0.02500 TEMP(K) = 1188.21 PRESS =
0.0
 Etot = 0.0000 EKtot = 0.0000 EPtot =
31975.5541
 BOND = 25627.3226 ANGLE = 5938.0900 DIHED =
880.9965
 1-4 NB = 690.5816 1-4 EEL = 1400.3019 VDWAALS =
119.5234
 EELEC = -2041.5006 EGB = -639.7612 RESTRAINT =
0.0000
NEB replicate breakdown:
Energy for replicate 1 = -32.7626
Energy for replicate 2 = -32.7329
Energy for replicate 3 = -32.7329
Energy for replicate 4 = -32.7329
Energy for replicate 5 = -32.7329
Energy for replicate 6 = -32.7329
Energy for replicate 7 = -32.7329
Energy for replicate 8 = -32.7329
Energy for replicate 9 = -32.7329
Energy for replicate 10 = -32.7329
Energy for replicate 11 = -32.7329
Energy for replicate 12 = -32.7329
Energy for replicate 13 = -32.7323
Energy for replicate 14 = -32.7201
Energy for replicate 15 = -27.0427
Energy for replicate 16 = 1051.8587
Energy for replicate 17 = 1696.9930
Energy for replicate 18 = 1687.6905
Energy for replicate 19 = 1740.1162
Energy for replicate 20 = 1682.0969
Energy for replicate 21 = 1678.2029
Energy for replicate 22 = 1486.9083
Energy for replicate 23 = 2720.0971
Energy for replicate 24 = 2061.0010
Energy for replicate 25 = 2684.6440
Energy for replicate 26 = 4699.5947
Energy for replicate 27 = 1992.0437
Energy for replicate 28 = 1240.5178
Energy for replicate 29 = 2154.1431
Energy for replicate 30 = 3884.9658
Total Energy of replicates = 31975.5541
NEB RMS = 0.000000

The complete pimdout files are attached.

And it would appear that the problem here is a function of us not having
sufficiently robust test cases. For example the PIMD test cases (and the NEB
ones) override whatever the user sets for DO_PARALLEL. E.g.:

else
  set MY_DO_PARALLEL="$DO_PARALLEL"
  set numprocs=`echo $DO_PARALLEL | awk -f numprocs.awk `
  if ( $numprocs != 4 ) then
    echo "this test is set up for 4 nodes only, changing node number to
4..."
    set MY_DO_PARALLEL=`echo $DO_PARALLEL | awk -f chgprocs.awk`
  endif

Thus they ALWAYS use 4 cpus - and incidentally they all have 4 beads or
replicas and so the situation where we have multiple cpus per bead or
replica is NEVER TESTED :-(. Hence how such a problem arrises...

Wei, since you know the NEB code better than me can you see if you can track
down this problem...

In addition I suggest we have a rethink of the way the PIMD test cases work
so that the above problem does not occur.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.




Received on Wed May 02 2007 - 06:07:21 PDT
Custom Search