Hi Ross,
I think I may have a clue about what is wrong in the code, would you please
send me all your input files (prmtop, inpcrd and mdin), so I can test it?
Sincerely,
Wei
Ross Walker wrote:
>Hi Wei, Francesco and others,
>
>As a follow-up to what I sent you earlier (Wei) I just ran some tests and
>have uncovered a problem with NEB and I likely suspect PIMD as well.
>
>I ran 50 steps of alanine phi-psi rotation based on the heating phase of the
>Amber 9 NEB tutorial but using Amber 10. This has 30 replicas. Running with
>30 MPI tasks I get what looks like a reasonable output file. The energies
>are reasonable and the energy of replica 1 and 30 remain fixed as they
>should do. So after 50 steps we have:
>
> NSTEP =       50   TIME(PS) =     0.02500  TEMP(K) =     0.97  PRESS =
>0.0
> Etot   =         0.0000  EKtot   =         0.0000  EPtot      =
>-923.8469
> BOND   =        23.8927  ANGLE   =        57.5721  DIHED      =
>270.9344
> 1-4 NB =        88.2706  1-4 EEL =      1323.4884  VDWAALS    =
>-49.3114
> EELEC  =     -2086.8697  EGB     =      -551.8240  RESTRAINT  =
>0.0000
>NEB replicate breakdown:
>Energy for replicate   1 =      -32.7646
>Energy for replicate   2 =      -32.7582
>Energy for replicate   3 =      -32.7582
>Energy for replicate   4 =      -32.7582
>Energy for replicate   5 =      -32.7582
>Energy for replicate   6 =      -32.7582
>Energy for replicate   7 =      -32.7582
>Energy for replicate   8 =      -32.7582
>Energy for replicate   9 =      -32.7582
>Energy for replicate  10 =      -32.7582
>Energy for replicate  11 =      -32.7582
>Energy for replicate  12 =      -32.7582
>Energy for replicate  13 =      -32.7580
>Energy for replicate  14 =      -32.7476
>Energy for replicate  15 =      -26.6800
>Energy for replicate  16 =      -26.1991
>Energy for replicate  17 =      -29.4524
>Energy for replicate  18 =      -29.4535
>Energy for replicate  19 =      -29.4535
>Energy for replicate  20 =      -29.4535
>Energy for replicate  21 =      -29.4535
>Energy for replicate  22 =      -29.4535
>Energy for replicate  23 =      -29.4535
>Energy for replicate  24 =      -29.4535
>Energy for replicate  25 =      -29.4535
>Energy for replicate  26 =      -29.4535
>Energy for replicate  27 =      -29.4535
>Energy for replicate  28 =      -29.4535
>Energy for replicate  29 =      -29.4535
>Energy for replicate  30 =      -29.4626
>Total Energy of replicates =     -923.8469
>NEB RMS =      0.000000
>
>Note the NEB RMS is still not working though and always shows 0.000000.
>
>Now if we repeat the exact same calculation with 60 cpus (i.e. 2 per
>replica) then we get garbage after 50 steps:
>
>
> NSTEP =       50   TIME(PS) =     0.02500  TEMP(K) =  1188.21  PRESS =
>0.0
> Etot   =         0.0000  EKtot   =         0.0000  EPtot      =
>31975.5541
> BOND   =     25627.3226  ANGLE   =      5938.0900  DIHED      =
>880.9965
> 1-4 NB =       690.5816  1-4 EEL =      1400.3019  VDWAALS    =
>119.5234
> EELEC  =     -2041.5006  EGB     =      -639.7612  RESTRAINT  =
>0.0000
>NEB replicate breakdown:
>Energy for replicate   1 =      -32.7626
>Energy for replicate   2 =      -32.7329
>Energy for replicate   3 =      -32.7329
>Energy for replicate   4 =      -32.7329
>Energy for replicate   5 =      -32.7329
>Energy for replicate   6 =      -32.7329
>Energy for replicate   7 =      -32.7329
>Energy for replicate   8 =      -32.7329
>Energy for replicate   9 =      -32.7329
>Energy for replicate  10 =      -32.7329
>Energy for replicate  11 =      -32.7329
>Energy for replicate  12 =      -32.7329
>Energy for replicate  13 =      -32.7323
>Energy for replicate  14 =      -32.7201
>Energy for replicate  15 =      -27.0427
>Energy for replicate  16 =     1051.8587
>Energy for replicate  17 =     1696.9930
>Energy for replicate  18 =     1687.6905
>Energy for replicate  19 =     1740.1162
>Energy for replicate  20 =     1682.0969
>Energy for replicate  21 =     1678.2029
>Energy for replicate  22 =     1486.9083
>Energy for replicate  23 =     2720.0971
>Energy for replicate  24 =     2061.0010
>Energy for replicate  25 =     2684.6440
>Energy for replicate  26 =     4699.5947
>Energy for replicate  27 =     1992.0437
>Energy for replicate  28 =     1240.5178
>Energy for replicate  29 =     2154.1431
>Energy for replicate  30 =     3884.9658
>Total Energy of replicates =    31975.5541
>NEB RMS =      0.000000
>
>The complete pimdout files are attached.
>
>And it would appear that the problem here is a function of us not having
>sufficiently robust test cases. For example the PIMD test cases (and the NEB
>ones) override whatever the user sets for DO_PARALLEL. E.g.:
>
>else 
>  set MY_DO_PARALLEL="$DO_PARALLEL"
>  set numprocs=`echo $DO_PARALLEL | awk -f numprocs.awk `
>  if ( $numprocs != 4 ) then
>    echo "this test is set up for 4 nodes only, changing node number to
>4..."
>    set MY_DO_PARALLEL=`echo $DO_PARALLEL | awk -f chgprocs.awk`
>  endif
>
>Thus they ALWAYS use 4 cpus - and incidentally they all have 4 beads or
>replicas and so the situation where we have multiple cpus per bead or
>replica is NEVER TESTED :-(. Hence how such a problem arrises...
>
>Wei, since you know the NEB code better than me can you see if you can track
>down this problem...
>
>In addition I suggest we have a rethink of the way the PIMD test cases work
>so that the above problem does not occur.
>
>All the best
>Ross
>
>/\
>\/
>|\oss Walker
>
>| HPC Consultant and Staff Scientist |
>| San Diego Supercomputer Center |
>| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>| http://www.rosswalker.co.uk | PGP Key available on request |
>
>Note: Electronic Mail is not secure, has no guarantee of delivery, may not
>be read every day, and should not be used for urgent or sensitive issues. 
>  
>
Received on Wed May 02 2007 - 06:07:22 PDT