Hi All,
I think I found the bug which causes full PIMD fail at 8 cpus. In
force.f and runmd.f
there are the codes like the following:
mpi_reduce( ..., commmaster, ... )
which should be replaced by:
if(master) mpi_reduce( ..., commmaster, ... )
once we fix these, the problem will be gone.
Right now I am having problems accessing the CVS tree. I will
check in the change
as soon as I can.
I am looking into the problem of NEB too.
Best Regards!
Wei
On Sep 8, 2008, at 7:04 PM, Francesco Paesani wrote:
> Hello,
>
> I guess that Wei and I are in charge of PIMD. Unfortunately, I am in
> Europe and I am not able to check the code right now. It seems to me
> that the problem arises when one tries to run full PIMD with a
> larger number of cpus than beads. So, the test cases now fail for
> #cpu > 4 because the test are just for 4 beads. I am not very
> familiar with the full parallelization so Wei may be more helpful in
> this case. However, I will look at the code as soon as I come back.
>
> Thanks,
> Francesco
>
>
> On Sep 8, 2008, at 9:21 AM, Carlos Simmerling wrote:
>
>> Hi Ross,
>> I don't know the PIMD code, but I have updated the NEB code and fixed
>> a lot of bugs. We just found what is hopefully the last bug over the
>> summer and I plan to check it in soon, just busy with the start of
>> the
>> semester. I think I'm the only one keeping track of the NEB code
>> these
>> days, when I had questions before nobody else replied... no idea if
>> anyone is maintaining the PIMD code. I split some of the NEB
>> initializations out of PIMD because I didn't want to change PIMD
>> code.
>> Carlos
>>
>> On Fri, Sep 5, 2008 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>> Hi All,
>>>
>>> Is anybody testing / using PIMD? It seems to be irreparably broken
>>> in
>>> parallel. This is for both amber10 and amber11.
>>>
>>> I found this out by trying to test NEB:
>>>
>>> export DO_PARALLEL='mpirun -np 8'
>>> cd $AMBERHOME/test/neb/neb_gb/
>>> ./Run.neb_classical
>>> [cli_0]: aborting job:
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> [cli_4]: aborting job:
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>>> [cli_5]: aborting job:
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>>> [cli_6]: program error.
>>>
>>> found in all output files:
>>>
>>> Parameters:
>>> number of beads = 8
>>> number of classical atoms = 22
>>> temperature (Kelvin) = 300.00
>>> ASSERTion 'ierr.eq.0' failed in pimd_init.f at line 320.
>>>
>>> (same error for both amber10 and amber11)
>>>
>>> Looking at pimd_init.f line 320 it is
>>>
>>> allocate( nrg_all(nbead), stat=ierr )
>>> REQUIRE(ierr.eq.0)
>>>
>>> So only trying to allocate nrg_all to be size 8 and nbead is set
>>> correct on
>>> all nodes so I can't understand why this is broken - maybe memory
>>> corruption
>>> elsewhere.
>>>
>>> BTW, pimd_init.f is also pretty dangerous - e.g. just below the
>>> allocation
>>> above we have:
>>>
>>> allocate( springforce(3*natomCL) )
>>> allocate( tangents(3*natomCL) )
>>> allocate( fitgroup(natomCL) )
>>> allocate( rmsgroup(natomCL) )
>>>
>>> so 4 allocate statements where NO return value is checked.
>>>
>>> It is not just NEB which is broken though. If I try to run PIMD
>>> itself:
>>>
>>> export DO_PARALLEL='mpirun -np 8'
>>> cd $AMBERHOME/test/
>>> make test.sander.PIMD.MPI.partial
>>>
>>> These all pass.
>>>
>>> make test.sander.PIMD.MPI.full
>>>
>>> Everything fails with invalid communicators etc. E.g.
>>> cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
>>> Testing Centroid MD
>>> [cli_3]: aborting job:
>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>> count=28,
>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>> MPI_Reduce(714): Null communicator
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>> count=28,
>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>> MPI_Reduce(714): Null communicator
>>> [cli_7]: aborting job:
>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>> count=28,
>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>> MPI_Reduce(714): Null communicator
>>> [cli_5]: aborting job:
>>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330,
>>> count=28,
>>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>>> MPI_Reduce(714): Null communicator
>>> [cli_0]: aborting job:
>>> Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>> MPI_Allreduce(696)........................:
>>> MPI_Allreduce(sbuf=0x1717c68,
>>> rbuf=0x1717c88, count=4, MPI_DOUBLE_PRECISION, MPI_SUM,
>>> MPI_COMM_WORLD)
>>> failed
>>> MPIR_Allreduce(285).......................:
>>> MPIC_Sendrecv(161)........................:
>>> MPIC_Wait(321)............................:
>>> MPIDI_CH3_Progress_wait(198)..............: an error occurred
>>> while handling
>>> an event returned by MPIDU_Sock_Wait()
>>> MPIDI_CH3I_Progress_handle_sock_event(422):
>>> MPIDU_Socki_handle_read(649)..............: connection failure
>>> (set=0,sock=3,errno=104:(strerror() not found))
>>> program error.
>>>
>>> So it looks like it is completely broken here.
>>>
>>> Interesting with 4cpus these test cases pass. So something is very
>>> wrong
>>> when you use something other than 4 cpus. Additionally of course
>>> the NEB
>>> code which used PIMD does not work.
>>>
>>> Any suggestions? Who is maintaining the PIMD code these days and
>>> wants to
>>> fix it and release some bugfixes?
>>>
>>> All the best
>>> Ross
>>>
>>> /\
>>> \/
>>> |\oss Walker
>>>
>>> | Assistant Research Professor |
>>> | San Diego Supercomputer Center |
>>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>>> | http://www.rosswalker.co.uk | PGP Key available on request |
>>>
>>> Note: Electronic Mail is not secure, has no guarantee of delivery,
>>> may not
>>> be read every day, and should not be used for urgent or sensitive
>>> issues.
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> ===================================================================
>> Carlos L. Simmerling, Ph.D.
>> Associate Professor Phone: (631) 632-1336
>> Center for Structural Biology Fax: (631) 632-1555
>> CMM Bldg, Room G80
>> Stony Brook University E-mail: carlos.simmerling.gmail.com
>> Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
>> ===================================================================
>
Received on Wed Sep 10 2008 - 06:07:45 PDT