Re: amber-developers: NEB is busted with ncpu /= nreplica (flaws in the parallel test cases)

From: Wei Zhang <zweig.scripps.edu>
Date: Mon, 30 Apr 2007 17:09:10 -0500

Hi Ross,

I think I may have a clue about what is wrong in the code, would you please
send me all your input files (prmtop, inpcrd and mdin), so I can test it?

Sincerely,

Wei


Ross Walker wrote:

>Hi Wei, Francesco and others,
>
>As a follow-up to what I sent you earlier (Wei) I just ran some tests and
>have uncovered a problem with NEB and I likely suspect PIMD as well.
>
>I ran 50 steps of alanine phi-psi rotation based on the heating phase of the
>Amber 9 NEB tutorial but using Amber 10. This has 30 replicas. Running with
>30 MPI tasks I get what looks like a reasonable output file. The energies
>are reasonable and the energy of replica 1 and 30 remain fixed as they
>should do. So after 50 steps we have:
>
> NSTEP = 50 TIME(PS) = 0.02500 TEMP(K) = 0.97 PRESS =
>0.0
> Etot = 0.0000 EKtot = 0.0000 EPtot =
>-923.8469
> BOND = 23.8927 ANGLE = 57.5721 DIHED =
>270.9344
> 1-4 NB = 88.2706 1-4 EEL = 1323.4884 VDWAALS =
>-49.3114
> EELEC = -2086.8697 EGB = -551.8240 RESTRAINT =
>0.0000
>NEB replicate breakdown:
>Energy for replicate 1 = -32.7646
>Energy for replicate 2 = -32.7582
>Energy for replicate 3 = -32.7582
>Energy for replicate 4 = -32.7582
>Energy for replicate 5 = -32.7582
>Energy for replicate 6 = -32.7582
>Energy for replicate 7 = -32.7582
>Energy for replicate 8 = -32.7582
>Energy for replicate 9 = -32.7582
>Energy for replicate 10 = -32.7582
>Energy for replicate 11 = -32.7582
>Energy for replicate 12 = -32.7582
>Energy for replicate 13 = -32.7580
>Energy for replicate 14 = -32.7476
>Energy for replicate 15 = -26.6800
>Energy for replicate 16 = -26.1991
>Energy for replicate 17 = -29.4524
>Energy for replicate 18 = -29.4535
>Energy for replicate 19 = -29.4535
>Energy for replicate 20 = -29.4535
>Energy for replicate 21 = -29.4535
>Energy for replicate 22 = -29.4535
>Energy for replicate 23 = -29.4535
>Energy for replicate 24 = -29.4535
>Energy for replicate 25 = -29.4535
>Energy for replicate 26 = -29.4535
>Energy for replicate 27 = -29.4535
>Energy for replicate 28 = -29.4535
>Energy for replicate 29 = -29.4535
>Energy for replicate 30 = -29.4626
>Total Energy of replicates = -923.8469
>NEB RMS = 0.000000
>
>Note the NEB RMS is still not working though and always shows 0.000000.
>
>Now if we repeat the exact same calculation with 60 cpus (i.e. 2 per
>replica) then we get garbage after 50 steps:
>
>
> NSTEP = 50 TIME(PS) = 0.02500 TEMP(K) = 1188.21 PRESS =
>0.0
> Etot = 0.0000 EKtot = 0.0000 EPtot =
>31975.5541
> BOND = 25627.3226 ANGLE = 5938.0900 DIHED =
>880.9965
> 1-4 NB = 690.5816 1-4 EEL = 1400.3019 VDWAALS =
>119.5234
> EELEC = -2041.5006 EGB = -639.7612 RESTRAINT =
>0.0000
>NEB replicate breakdown:
>Energy for replicate 1 = -32.7626
>Energy for replicate 2 = -32.7329
>Energy for replicate 3 = -32.7329
>Energy for replicate 4 = -32.7329
>Energy for replicate 5 = -32.7329
>Energy for replicate 6 = -32.7329
>Energy for replicate 7 = -32.7329
>Energy for replicate 8 = -32.7329
>Energy for replicate 9 = -32.7329
>Energy for replicate 10 = -32.7329
>Energy for replicate 11 = -32.7329
>Energy for replicate 12 = -32.7329
>Energy for replicate 13 = -32.7323
>Energy for replicate 14 = -32.7201
>Energy for replicate 15 = -27.0427
>Energy for replicate 16 = 1051.8587
>Energy for replicate 17 = 1696.9930
>Energy for replicate 18 = 1687.6905
>Energy for replicate 19 = 1740.1162
>Energy for replicate 20 = 1682.0969
>Energy for replicate 21 = 1678.2029
>Energy for replicate 22 = 1486.9083
>Energy for replicate 23 = 2720.0971
>Energy for replicate 24 = 2061.0010
>Energy for replicate 25 = 2684.6440
>Energy for replicate 26 = 4699.5947
>Energy for replicate 27 = 1992.0437
>Energy for replicate 28 = 1240.5178
>Energy for replicate 29 = 2154.1431
>Energy for replicate 30 = 3884.9658
>Total Energy of replicates = 31975.5541
>NEB RMS = 0.000000
>
>The complete pimdout files are attached.
>
>And it would appear that the problem here is a function of us not having
>sufficiently robust test cases. For example the PIMD test cases (and the NEB
>ones) override whatever the user sets for DO_PARALLEL. E.g.:
>
>else
> set MY_DO_PARALLEL="$DO_PARALLEL"
> set numprocs=`echo $DO_PARALLEL | awk -f numprocs.awk `
> if ( $numprocs != 4 ) then
> echo "this test is set up for 4 nodes only, changing node number to
>4..."
> set MY_DO_PARALLEL=`echo $DO_PARALLEL | awk -f chgprocs.awk`
> endif
>
>Thus they ALWAYS use 4 cpus - and incidentally they all have 4 beads or
>replicas and so the situation where we have multiple cpus per bead or
>replica is NEVER TESTED :-(. Hence how such a problem arrises...
>
>Wei, since you know the NEB code better than me can you see if you can track
>down this problem...
>
>In addition I suggest we have a rethink of the way the PIMD test cases work
>so that the above problem does not occur.
>
>All the best
>Ross
>
>/\
>\/
>|\oss Walker
>
>| HPC Consultant and Staff Scientist |
>| San Diego Supercomputer Center |
>| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>| http://www.rosswalker.co.uk | PGP Key available on request |
>
>Note: Electronic Mail is not secure, has no guarantee of delivery, may not
>be read every day, and should not be used for urgent or sensitive issues.
>
>
Received on Wed May 02 2007 - 06:07:22 PDT
Custom Search