Re: amber-developers: Testing of PIMD / NEB? (Broken in AMBER 10) from Wei Zhang on 2008-09-10 (Amber Developers Archive Sep 2008)

From: Wei Zhang <zgjzweig.gmail.com>
Date: Wed, 10 Sep 2008 11:41:56 -0500

Hi Ross,

     I just checked in my changes into amber11 cvs tree.

     Attached it a bugfix for amber10.

     Please let know if they work for you. Note after my patch
the NEB test case still show as "possible failure". The difference
in my view is trivial it just some lines of "---------".

     Sincerely,

     Wei

On Sep 10, 2008, at 7:24 AM, Ross Walker wrote:

> Hi Wei,
>
> Send any fixes to me that you can't check in and I'll try and do
> it / see if
> I can fix it.
>
> All the best
> Ross
>
>> -----Original Message-----
>> From: owner-amber-developers.scripps.edu [mailto:owner-amber-
>> developers.scripps.edu] On Behalf Of Wei Zhang
>> Sent: Tuesday, September 09, 2008 8:45 AM
>> To: amber-developers.scripps.edu
>> Subject: Re: amber-developers: Testing of PIMD / NEB? (Broken in
>> AMBER 10)
>>
>> Hi All,
>>
>> I think I found the bug causes NEB test case to fail. In
>> sander.f, there are
>> lines like the following:
>>
>> if ( ipimd > 0 .or. ineb > 0 ) then
>> call pimd_init(natom,x(lmass),x(lwinv),x(lvel),ipimd)
>> end if
>>
>> which should be changed to:
>>
>> if ( ipimd > 0 ) then
>> call pimd_init(natom,x(lmass),x(lwinv),x(lvel),ipimd)
>> end if
>>
>> This should fix the problem.
>>
>> As I mentioned before, right now I cannot access the CVS tree. I will
>> check in
>> the change and prepare a bugfix as soon as I can.
>>
>>
>>
>>
>> Sincerely,
>>
>> Wei
>>
>>
>>
>>
>> Sincerely,
>>
>> Wei
>>
>> On Sep 8, 2008, at 7:04 PM, Francesco Paesani wrote:
>>
>>> Hello,
>>>
>>> I guess that Wei and I are in charge of PIMD. Unfortunately, I am in
>>> Europe and I am not able to check the code right now. It seems to me
>>> that the problem arises when one tries to run full PIMD with a
>>> larger number of cpus than beads. So, the test cases now fail for
>>> #cpu > 4 because the test are just for 4 beads. I am not very
>>> familiar with the full parallelization so Wei may be more helpful in
>>> this case. However, I will look at the code as soon as I come back.
>>>
>>> Thanks,
>>> Francesco
>>>
>>>
>>> On Sep 8, 2008, at 9:21 AM, Carlos Simmerling wrote:
>>>
>>>> Hi Ross,
>>>> I don't know the PIMD code, but I have updated the NEB code and
>>>> fixed
>>>> a lot of bugs. We just found what is hopefully the last bug over
>>>> the
>>>> summer and I plan to check it in soon, just busy with the start of
>>>> the
>>>> semester. I think I'm the only one keeping track of the NEB code
>>>> these
>>>> days, when I had questions before nobody else replied... no idea if
>>>> anyone is maintaining the PIMD code. I split some of the NEB
>>>> initializations out of PIMD because I didn't want to change PIMD
>>>> code.
>>>> Carlos
>>>>
>>>> On Fri, Sep 5, 2008 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk>
>>>> wrote:
>>>>> Hi All,
>>>>>
>>>>> Is anybody testing / using PIMD? It seems to be irreparably broken
>>>>> in
>>>>> parallel. This is for both amber10 and amber11.
>>>>>
>>>>> I found this out by trying to test NEB:
>>>>>
>>>>> export DO_PARALLEL='mpirun -np 8'
>>>>> cd $AMBERHOME/test/neb/neb_gb/
>>>>> ./Run.neb_classical
>>>>> [cli_0]: aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>>>> [cli_4]: aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>>>>> [cli_5]: aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>>>>> [cli_6]: program error.
>>>>>
>>>>> found in all output files:
>>>>>
>>>>> Parameters:
>>>>> number of beads = 8
>>>>> number of classical atoms = 22
>>>>> temperature (Kelvin) = 300.00
>>>>> ASSERTion 'ierr.eq.0' failed in pimd_init.f at line 320.
>>>>>
>>>>> (same error for both amber10 and amber11)
>>>>>
>>>>> Looking at pimd_init.f line 320 it is
>>>>>
>>>>> allocate( nrg_all(nbead), stat=ierr )
>>>>> REQUIRE(ierr.eq.0)
>>>>>
>>>>> So only trying to allocate nrg_all to be size 8 and nbead is set
>>>>> correct on
>>>>> all nodes so I can't understand why this is broken - maybe memory
>>>>> corruption
>>>>> elsewhere.
>>>>>
>>>>> BTW, pimd_init.f is also pretty dangerous - e.g. just below the
>>>>> allocation
>>>>> above we have:
>>>>>
>>>>> allocate( springforce(3*natomCL) )
>>>>> allocate( tangents(3*natomCL) )
>>>>> allocate( fitgroup(natomCL) )
>>>>> allocate( rmsgroup(natomCL) )
>>>>>
>>>>> so 4 allocate statements where NO return value is checked.
>>>>>
>>>>> It is not just NEB which is broken though. If I try to run PIMD
>>>>> itself:
>>>>>
>>>>> export DO_PARALLEL='mpirun -np 8'
>>>>> cd $AMBERHOME/test/
>>>>> make test.sander.PIMD.MPI.partial
>>>>>
>>>>> These all pass.
>>>>>
>>>>> make test.sander.PIMD.MPI.full
>>>>>
>>>>> Everything fails with invalid communicators etc. E.g.
>>>>> cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
>>>>> Testing Centroid MD
>>>>> [cli_3]: aborting job:
>>>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>>>> count=28,
>>>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>>>> MPI_Reduce(714): Null communicator
>>>>> [cli_1]: aborting job:
>>>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>>>> count=28,
>>>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>>>> MPI_Reduce(714): Null communicator
>>>>> [cli_7]: aborting job:
>>>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>>>> count=28,
>>>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>>>> MPI_Reduce(714): Null communicator
>>>>> [cli_5]: aborting job:
>>>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>>>> count=28,
>>>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>>>> MPI_Reduce(714): Null communicator
>>>>> [cli_0]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>>>> MPI_Allreduce(696)........................:
>>>>> MPI_Allreduce(sbuf=0x1717c68,
>>>>> rbuf=0x1717c88, count=4, MPI_DOUBLE_PRECISION, MPI_SUM,
>>>>> MPI_COMM_WORLD)
>>>>> failed
>>>>> MPIR_Allreduce(285).......................:
>>>>> MPIC_Sendrecv(161)........................:
>>>>> MPIC_Wait(321)............................:
>>>>> MPIDI_CH3_Progress_wait(198)..............: an error occurred
>>>>> while handling
>>>>> an event returned by MPIDU_Sock_Wait()
>>>>> MPIDI_CH3I_Progress_handle_sock_event(422):
>>>>> MPIDU_Socki_handle_read(649)..............: connection failure
>>>>> (set=0,sock=3,errno=104:(strerror() not found))
>>>>> program error.
>>>>>
>>>>> So it looks like it is completely broken here.
>>>>>
>>>>> Interesting with 4cpus these test cases pass. So something is very
>>>>> wrong
>>>>> when you use something other than 4 cpus. Additionally of course
>>>>> the NEB
>>>>> code which used PIMD does not work.
>>>>>
>>>>> Any suggestions? Who is maintaining the PIMD code these days and
>>>>> wants to
>>>>> fix it and release some bugfixes?
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>> /\
>>>>> \/
>>>>> |\oss Walker
>>>>>
>>>>> | Assistant Research Professor |
>>>>> | San Diego Supercomputer Center |
>>>>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>>>>> | http://www.rosswalker.co.uk | PGP Key available on request |
>>>>>
>>>>> Note: Electronic Mail is not secure, has no guarantee of delivery,
>>>>> may not
>>>>> be read every day, and should not be used for urgent or sensitive
>>>>> issues.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ===================================================================
>>>> Carlos L. Simmerling, Ph.D.
>>>> Associate Professor Phone: (631) 632-1336
>>>> Center for Structural Biology Fax: (631) 632-1555
>>>> CMM Bldg, Room G80
>>>> Stony Brook University E-mail: carlos.simmerling.gmail.com
>>>> Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
>>>> ===================================================================
>>>
>

application/octet-stream attachment: amber10_bugfix.full_pimd_and_neb

Received on Thu Sep 11 2008 - 08:41:54 PDT