Re: amber-developers: Testing of PIMD / NEB? (Broken in AMBER 10)

From: Carlos Simmerling <carlos.simmerling.gmail.com>
Date: Mon, 8 Sep 2008 11:21:37 -0400

Hi Ross,
I don't know the PIMD code, but I have updated the NEB code and fixed
a lot of bugs. We just found what is hopefully the last bug over the
summer and I plan to check it in soon, just busy with the start of the
semester. I think I'm the only one keeping track of the NEB code these
days, when I had questions before nobody else replied... no idea if
anyone is maintaining the PIMD code. I split some of the NEB
initializations out of PIMD because I didn't want to change PIMD code.
Carlos

On Fri, Sep 5, 2008 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi All,
>
> Is anybody testing / using PIMD? It seems to be irreparably broken in
> parallel. This is for both amber10 and amber11.
>
> I found this out by trying to test NEB:
>
> export DO_PARALLEL='mpirun -np 8'
> cd $AMBERHOME/test/neb/neb_gb/
> ./Run.neb_classical
> [cli_0]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> [cli_4]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
> [cli_5]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
> [cli_6]: program error.
>
> found in all output files:
>
> Parameters:
> number of beads = 8
> number of classical atoms = 22
> temperature (Kelvin) = 300.00
> ASSERTion 'ierr.eq.0' failed in pimd_init.f at line 320.
>
> (same error for both amber10 and amber11)
>
> Looking at pimd_init.f line 320 it is
>
> allocate( nrg_all(nbead), stat=ierr )
> REQUIRE(ierr.eq.0)
>
> So only trying to allocate nrg_all to be size 8 and nbead is set correct on
> all nodes so I can't understand why this is broken - maybe memory corruption
> elsewhere.
>
> BTW, pimd_init.f is also pretty dangerous - e.g. just below the allocation
> above we have:
>
> allocate( springforce(3*natomCL) )
> allocate( tangents(3*natomCL) )
> allocate( fitgroup(natomCL) )
> allocate( rmsgroup(natomCL) )
>
> so 4 allocate statements where NO return value is checked.
>
> It is not just NEB which is broken though. If I try to run PIMD itself:
>
> export DO_PARALLEL='mpirun -np 8'
> cd $AMBERHOME/test/
> make test.sander.PIMD.MPI.partial
>
> These all pass.
>
> make test.sander.PIMD.MPI.full
>
> Everything fails with invalid communicators etc. E.g.
> cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
> Testing Centroid MD
> [cli_3]: aborting job:
> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> MPI_Reduce(714): Null communicator
> [cli_1]: aborting job:
> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> MPI_Reduce(714): Null communicator
> [cli_7]: aborting job:
> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> MPI_Reduce(714): Null communicator
> [cli_5]: aborting job:
> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> MPI_Reduce(714): Null communicator
> [cli_0]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(696)........................: MPI_Allreduce(sbuf=0x1717c68,
> rbuf=0x1717c88, count=4, MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD)
> failed
> MPIR_Allreduce(285).......................:
> MPIC_Sendrecv(161)........................:
> MPIC_Wait(321)............................:
> MPIDI_CH3_Progress_wait(198)..............: an error occurred while handling
> an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(422):
> MPIDU_Socki_handle_read(649)..............: connection failure
> (set=0,sock=3,errno=104:(strerror() not found))
> program error.
>
> So it looks like it is completely broken here.
>
> Interesting with 4cpus these test cases pass. So something is very wrong
> when you use something other than 4 cpus. Additionally of course the NEB
> code which used PIMD does not work.
>
> Any suggestions? Who is maintaining the PIMD code these days and wants to
> fix it and release some bugfixes?
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>



-- 
===================================================================
Carlos L. Simmerling, Ph.D.
Associate Professor Phone: (631) 632-1336
Center for Structural Biology Fax: (631) 632-1555
CMM Bldg, Room G80
Stony Brook University E-mail: carlos.simmerling.gmail.com
Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
===================================================================
Received on Wed Sep 10 2008 - 06:07:25 PDT
Custom Search