amber-developers: Testing of PIMD / NEB? (Broken in AMBER 10)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 5 Sep 2008 14:46:38 -0700

Hi All,

Is anybody testing / using PIMD? It seems to be irreparably broken in
parallel. This is for both amber10 and amber11.

I found this out by trying to test NEB:

export DO_PARALLEL='mpirun -np 8'
cd $AMBERHOME/test/neb/neb_gb/
./Run.neb_classical
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[cli_4]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
[cli_5]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
[cli_6]: program error.

found in all output files:

Parameters:
number of beads = 8
number of classical atoms = 22
temperature (Kelvin) = 300.00
ASSERTion 'ierr.eq.0' failed in pimd_init.f at line 320.

(same error for both amber10 and amber11)

Looking at pimd_init.f line 320 it is

   allocate( nrg_all(nbead), stat=ierr )
   REQUIRE(ierr.eq.0)

So only trying to allocate nrg_all to be size 8 and nbead is set correct on
all nodes so I can't understand why this is broken - maybe memory corruption
elsewhere.

BTW, pimd_init.f is also pretty dangerous - e.g. just below the allocation
above we have:

   allocate( springforce(3*natomCL) )
   allocate( tangents(3*natomCL) )
   allocate( fitgroup(natomCL) )
   allocate( rmsgroup(natomCL) )

so 4 allocate statements where NO return value is checked.

It is not just NEB which is broken though. If I try to run PIMD itself:

export DO_PARALLEL='mpirun -np 8'
cd $AMBERHOME/test/
make test.sander.PIMD.MPI.partial

These all pass.

make test.sander.PIMD.MPI.full

Everything fails with invalid communicators etc. E.g.
cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
Testing Centroid MD
[cli_3]: aborting job:
Fatal error in MPI_Reduce: Invalid communicator, error stack:
MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
MPI_Reduce(714): Null communicator
[cli_1]: aborting job:
Fatal error in MPI_Reduce: Invalid communicator, error stack:
MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
MPI_Reduce(714): Null communicator
[cli_7]: aborting job:
Fatal error in MPI_Reduce: Invalid communicator, error stack:
MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
MPI_Reduce(714): Null communicator
[cli_5]: aborting job:
Fatal error in MPI_Reduce: Invalid communicator, error stack:
MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
MPI_Reduce(714): Null communicator
[cli_0]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(696)........................: MPI_Allreduce(sbuf=0x1717c68,
rbuf=0x1717c88, count=4, MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD)
failed
MPIR_Allreduce(285).......................:
MPIC_Sendrecv(161)........................:
MPIC_Wait(321)............................:
MPIDI_CH3_Progress_wait(198)..............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(422):
MPIDU_Socki_handle_read(649)..............: connection failure
(set=0,sock=3,errno=104:(strerror() not found))
program error.

So it looks like it is completely broken here.

Interesting with 4cpus these test cases pass. So something is very wrong
when you use something other than 4 cpus. Additionally of course the NEB
code which used PIMD does not work.

Any suggestions? Who is maintaining the PIMD code these days and wants to
fix it and release some bugfixes?

All the best
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
<a href="http://archive.ambermd.org.">The Amber Mailing List Archive recently moved to archive.ambermd.org</a>
<a href="http://ambermd.org.">The Amber Molecular Dynamics website recently moved to ambermd.org</a>
<a href="http://ross.ch.ic.ac.uk/adsense_top10/">The top 10 Google adsense and adwords alternatives</a>
<a href="http://ross.ch.ic.ac.uk/adwords_top10/">The top 10 Google adwords alternatives</a>
<a href="http://ross.ch.ic.ac.uk/tivo_upgrade/">A Guide to upgrading Tivo and Tivo HD DVRs and PVRs</a>
<a href="http://ross.ch.ic.ac.uk/adsense_alternatives/">The best alternatives to Google adsense and adwords</a>
Received on Sun Sep 07 2008 - 06:07:55 PDT
Custom Search