RE: amber-developers: Testing of PIMD / NEB? (Broken in AMBER 10)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 10 Sep 2008 05:24:01 -0700

Hi Wei,

Send any fixes to me that you can't check in and I'll try and do it / see if
I can fix it.

All the best
Ross

> -----Original Message-----
> From: owner-amber-developers.scripps.edu [mailto:owner-amber-
> developers.scripps.edu] On Behalf Of Wei Zhang
> Sent: Tuesday, September 09, 2008 8:45 AM
> To: amber-developers.scripps.edu
> Subject: Re: amber-developers: Testing of PIMD / NEB? (Broken in AMBER 10)
>
> Hi All,
>
> I think I found the bug causes NEB test case to fail. In
> sander.f, there are
> lines like the following:
>
> if ( ipimd > 0 .or. ineb > 0 ) then
> call pimd_init(natom,x(lmass),x(lwinv),x(lvel),ipimd)
> end if
>
> which should be changed to:
>
> if ( ipimd > 0 ) then
> call pimd_init(natom,x(lmass),x(lwinv),x(lvel),ipimd)
> end if
>
> This should fix the problem.
>
> As I mentioned before, right now I cannot access the CVS tree. I will
> check in
> the change and prepare a bugfix as soon as I can.
>
>
>
>
> Sincerely,
>
> Wei
>
>
>
>
> Sincerely,
>
> Wei
>
> On Sep 8, 2008, at 7:04 PM, Francesco Paesani wrote:
>
> > Hello,
> >
> > I guess that Wei and I are in charge of PIMD. Unfortunately, I am in
> > Europe and I am not able to check the code right now. It seems to me
> > that the problem arises when one tries to run full PIMD with a
> > larger number of cpus than beads. So, the test cases now fail for
> > #cpu > 4 because the test are just for 4 beads. I am not very
> > familiar with the full parallelization so Wei may be more helpful in
> > this case. However, I will look at the code as soon as I come back.
> >
> > Thanks,
> > Francesco
> >
> >
> > On Sep 8, 2008, at 9:21 AM, Carlos Simmerling wrote:
> >
> >> Hi Ross,
> >> I don't know the PIMD code, but I have updated the NEB code and fixed
> >> a lot of bugs. We just found what is hopefully the last bug over the
> >> summer and I plan to check it in soon, just busy with the start of
> >> the
> >> semester. I think I'm the only one keeping track of the NEB code
> >> these
> >> days, when I had questions before nobody else replied... no idea if
> >> anyone is maintaining the PIMD code. I split some of the NEB
> >> initializations out of PIMD because I didn't want to change PIMD
> >> code.
> >> Carlos
> >>
> >> On Fri, Sep 5, 2008 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk>
> >> wrote:
> >>> Hi All,
> >>>
> >>> Is anybody testing / using PIMD? It seems to be irreparably broken
> >>> in
> >>> parallel. This is for both amber10 and amber11.
> >>>
> >>> I found this out by trying to test NEB:
> >>>
> >>> export DO_PARALLEL='mpirun -np 8'
> >>> cd $AMBERHOME/test/neb/neb_gb/
> >>> ./Run.neb_classical
> >>> [cli_0]: aborting job:
> >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> >>> [cli_4]: aborting job:
> >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
> >>> [cli_5]: aborting job:
> >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
> >>> [cli_6]: program error.
> >>>
> >>> found in all output files:
> >>>
> >>> Parameters:
> >>> number of beads = 8
> >>> number of classical atoms = 22
> >>> temperature (Kelvin) = 300.00
> >>> ASSERTion 'ierr.eq.0' failed in pimd_init.f at line 320.
> >>>
> >>> (same error for both amber10 and amber11)
> >>>
> >>> Looking at pimd_init.f line 320 it is
> >>>
> >>> allocate( nrg_all(nbead), stat=ierr )
> >>> REQUIRE(ierr.eq.0)
> >>>
> >>> So only trying to allocate nrg_all to be size 8 and nbead is set
> >>> correct on
> >>> all nodes so I can't understand why this is broken - maybe memory
> >>> corruption
> >>> elsewhere.
> >>>
> >>> BTW, pimd_init.f is also pretty dangerous - e.g. just below the
> >>> allocation
> >>> above we have:
> >>>
> >>> allocate( springforce(3*natomCL) )
> >>> allocate( tangents(3*natomCL) )
> >>> allocate( fitgroup(natomCL) )
> >>> allocate( rmsgroup(natomCL) )
> >>>
> >>> so 4 allocate statements where NO return value is checked.
> >>>
> >>> It is not just NEB which is broken though. If I try to run PIMD
> >>> itself:
> >>>
> >>> export DO_PARALLEL='mpirun -np 8'
> >>> cd $AMBERHOME/test/
> >>> make test.sander.PIMD.MPI.partial
> >>>
> >>> These all pass.
> >>>
> >>> make test.sander.PIMD.MPI.full
> >>>
> >>> Everything fails with invalid communicators etc. E.g.
> >>> cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
> >>> Testing Centroid MD
> >>> [cli_3]: aborting job:
> >>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> >>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
> >>> count=28,
> >>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> >>> MPI_Reduce(714): Null communicator
> >>> [cli_1]: aborting job:
> >>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> >>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
> >>> count=28,
> >>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> >>> MPI_Reduce(714): Null communicator
> >>> [cli_7]: aborting job:
> >>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> >>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
> >>> count=28,
> >>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> >>> MPI_Reduce(714): Null communicator
> >>> [cli_5]: aborting job:
> >>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
> >>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
> >>> count=28,
> >>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
> >>> MPI_Reduce(714): Null communicator
> >>> [cli_0]: aborting job:
> >>> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> >>> MPI_Allreduce(696)........................:
> >>> MPI_Allreduce(sbuf=0x1717c68,
> >>> rbuf=0x1717c88, count=4, MPI_DOUBLE_PRECISION, MPI_SUM,
> >>> MPI_COMM_WORLD)
> >>> failed
> >>> MPIR_Allreduce(285).......................:
> >>> MPIC_Sendrecv(161)........................:
> >>> MPIC_Wait(321)............................:
> >>> MPIDI_CH3_Progress_wait(198)..............: an error occurred
> >>> while handling
> >>> an event returned by MPIDU_Sock_Wait()
> >>> MPIDI_CH3I_Progress_handle_sock_event(422):
> >>> MPIDU_Socki_handle_read(649)..............: connection failure
> >>> (set=0,sock=3,errno=104:(strerror() not found))
> >>> program error.
> >>>
> >>> So it looks like it is completely broken here.
> >>>
> >>> Interesting with 4cpus these test cases pass. So something is very
> >>> wrong
> >>> when you use something other than 4 cpus. Additionally of course
> >>> the NEB
> >>> code which used PIMD does not work.
> >>>
> >>> Any suggestions? Who is maintaining the PIMD code these days and
> >>> wants to
> >>> fix it and release some bugfixes?
> >>>
> >>> All the best
> >>> Ross
> >>>
> >>> /\
> >>> \/
> >>> |\oss Walker
> >>>
> >>> | Assistant Research Professor |
> >>> | San Diego Supercomputer Center |
> >>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> >>> | http://www.rosswalker.co.uk | PGP Key available on request |
> >>>
> >>> Note: Electronic Mail is not secure, has no guarantee of delivery,
> >>> may not
> >>> be read every day, and should not be used for urgent or sensitive
> >>> issues.
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> ===================================================================
> >> Carlos L. Simmerling, Ph.D.
> >> Associate Professor Phone: (631) 632-1336
> >> Center for Structural Biology Fax: (631) 632-1555
> >> CMM Bldg, Room G80
> >> Stony Brook University E-mail: carlos.simmerling.gmail.com
> >> Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
> >> ===================================================================
> >
Received on Thu Sep 11 2008 - 08:41:12 PDT
Custom Search