amber-developers: NEB is busted with ncpu /= nreplica (flaws in the parallel test cases) from Ross Walker on 2007-04-30 (Amber Developers Archive Apr 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 30 Apr 2007 13:53:51 -0700

Hi Wei, Francesco and others,

As a follow-up to what I sent you earlier (Wei) I just ran some tests and
have uncovered a problem with NEB and I likely suspect PIMD as well.

I ran 50 steps of alanine phi-psi rotation based on the heating phase of the
Amber 9 NEB tutorial but using Amber 10. This has 30 replicas. Running with
30 MPI tasks I get what looks like a reasonable output file. The energies
are reasonable and the energy of replica 1 and 30 remain fixed as they
should do. So after 50 steps we have:

NSTEP = 50 TIME(PS) = 0.02500 TEMP(K) = 0.97 PRESS =
0.0
Etot = 0.0000 EKtot = 0.0000 EPtot =
-923.8469
BOND = 23.8927 ANGLE = 57.5721 DIHED =
270.9344
1-4 NB = 88.2706 1-4 EEL = 1323.4884 VDWAALS =
-49.3114
EELEC = -2086.8697 EGB = -551.8240 RESTRAINT =
0.0000
NEB replicate breakdown:
Energy for replicate 1 = -32.7646
Energy for replicate 2 = -32.7582
Energy for replicate 3 = -32.7582
Energy for replicate 4 = -32.7582
Energy for replicate 5 = -32.7582
Energy for replicate 6 = -32.7582
Energy for replicate 7 = -32.7582
Energy for replicate 8 = -32.7582
Energy for replicate 9 = -32.7582
Energy for replicate 10 = -32.7582
Energy for replicate 11 = -32.7582
Energy for replicate 12 = -32.7582
Energy for replicate 13 = -32.7580
Energy for replicate 14 = -32.7476
Energy for replicate 15 = -26.6800
Energy for replicate 16 = -26.1991
Energy for replicate 17 = -29.4524
Energy for replicate 18 = -29.4535
Energy for replicate 19 = -29.4535
Energy for replicate 20 = -29.4535
Energy for replicate 21 = -29.4535
Energy for replicate 22 = -29.4535
Energy for replicate 23 = -29.4535
Energy for replicate 24 = -29.4535
Energy for replicate 25 = -29.4535
Energy for replicate 26 = -29.4535
Energy for replicate 27 = -29.4535
Energy for replicate 28 = -29.4535
Energy for replicate 29 = -29.4535
Energy for replicate 30 = -29.4626
Total Energy of replicates = -923.8469
NEB RMS = 0.000000

Note the NEB RMS is still not working though and always shows 0.000000.

Now if we repeat the exact same calculation with 60 cpus (i.e. 2 per
replica) then we get garbage after 50 steps:

NSTEP = 50 TIME(PS) = 0.02500 TEMP(K) = 1188.21 PRESS =
0.0
Etot = 0.0000 EKtot = 0.0000 EPtot =
31975.5541
BOND = 25627.3226 ANGLE = 5938.0900 DIHED =
880.9965
1-4 NB = 690.5816 1-4 EEL = 1400.3019 VDWAALS =
119.5234
EELEC = -2041.5006 EGB = -639.7612 RESTRAINT =
0.0000
NEB replicate breakdown:
Energy for replicate 1 = -32.7626
Energy for replicate 2 = -32.7329
Energy for replicate 3 = -32.7329
Energy for replicate 4 = -32.7329
Energy for replicate 5 = -32.7329
Energy for replicate 6 = -32.7329
Energy for replicate 7 = -32.7329
Energy for replicate 8 = -32.7329
Energy for replicate 9 = -32.7329
Energy for replicate 10 = -32.7329
Energy for replicate 11 = -32.7329
Energy for replicate 12 = -32.7329
Energy for replicate 13 = -32.7323
Energy for replicate 14 = -32.7201
Energy for replicate 15 = -27.0427
Energy for replicate 16 = 1051.8587
Energy for replicate 17 = 1696.9930
Energy for replicate 18 = 1687.6905
Energy for replicate 19 = 1740.1162
Energy for replicate 20 = 1682.0969
Energy for replicate 21 = 1678.2029
Energy for replicate 22 = 1486.9083
Energy for replicate 23 = 2720.0971
Energy for replicate 24 = 2061.0010
Energy for replicate 25 = 2684.6440
Energy for replicate 26 = 4699.5947
Energy for replicate 27 = 1992.0437
Energy for replicate 28 = 1240.5178
Energy for replicate 29 = 2154.1431
Energy for replicate 30 = 3884.9658
Total Energy of replicates = 31975.5541
NEB RMS = 0.000000

The complete pimdout files are attached.

And it would appear that the problem here is a function of us not having
sufficiently robust test cases. For example the PIMD test cases (and the NEB
ones) override whatever the user sets for DO_PARALLEL. E.g.:

else
  set MY_DO_PARALLEL="$DO_PARALLEL"
  set numprocs=`echo $DO_PARALLEL | awk -f numprocs.awk `
  if ( $numprocs != 4 ) then
    echo "this test is set up for 4 nodes only, changing node number to
4..."
    set MY_DO_PARALLEL=`echo $DO_PARALLEL | awk -f chgprocs.awk`
  endif

Thus they ALWAYS use 4 cpus - and incidentally they all have 4 beads or
replicas and so the situation where we have multiple cpus per bead or
replica is NEVER TESTED :-(. Hence how such a problem arrises...

Wei, since you know the NEB code better than me can you see if you can track
down this problem...

In addition I suggest we have a rethink of the way the PIMD test cases work
so that the above problem does not occur.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

application/octet-stream attachment: pimdout.30cpu

application/octet-stream attachment: pimdout.60cpu

Received on Wed May 02 2007 - 06:07:21 PDT