Re: amber-developers: Testing of PIMD / NEB? (Broken in AMBER 10)

From: Francesco Paesani <fpaesani.gmail.com>
Date: Mon, 8 Sep 2008 18:04:51 -0600

Hello,

I guess that Wei and I are in charge of PIMD. Unfortunately, I am in
Europe and I am not able to check the code right now. It seems to me
that the problem arises when one tries to run full PIMD with a larger
number of cpus than beads. So, the test cases now fail for #cpu > 4
because the test are just for 4 beads. I am not very familiar with the
full parallelization so Wei may be more helpful in this case. However,
I will look at the code as soon as I come back.

Thanks,
Francesco


On Sep 8, 2008, at 9:21 AM, Carlos Simmerling wrote:

> Hi Ross,
> I don't know the PIMD code, but I have updated the NEB code and fixed
> a lot of bugs. We just found what is hopefully the last bug over the
> summer and I plan to check it in soon, just busy with the start of the
> semester. I think I'm the only one keeping track of the NEB code these
> days, when I had questions before nobody else replied... no idea if
> anyone is maintaining the PIMD code. I split some of the NEB
> initializations out of PIMD because I didn't want to change PIMD code.
> Carlos
>
> On Fri, Sep 5, 2008 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>> Hi All,
>>
>> Is anybody testing / using PIMD? It seems to be irreparably broken in
>> parallel. This is for both amber10 and amber11.
>>
>> I found this out by trying to test NEB:
>>
>> export DO_PARALLEL='mpirun -np 8'
>> cd $AMBERHOME/test/neb/neb_gb/
>> ./Run.neb_classical
>> [cli_0]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>> [cli_4]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>> [cli_5]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>> [cli_6]: program error.
>>
>> found in all output files:
>>
>> Parameters:
>> number of beads = 8
>> number of classical atoms = 22
>> temperature (Kelvin) = 300.00
>> ASSERTion 'ierr.eq.0' failed in pimd_init.f at line 320.
>>
>> (same error for both amber10 and amber11)
>>
>> Looking at pimd_init.f line 320 it is
>>
>> allocate( nrg_all(nbead), stat=ierr )
>> REQUIRE(ierr.eq.0)
>>
>> So only trying to allocate nrg_all to be size 8 and nbead is set
>> correct on
>> all nodes so I can't understand why this is broken - maybe memory
>> corruption
>> elsewhere.
>>
>> BTW, pimd_init.f is also pretty dangerous - e.g. just below the
>> allocation
>> above we have:
>>
>> allocate( springforce(3*natomCL) )
>> allocate( tangents(3*natomCL) )
>> allocate( fitgroup(natomCL) )
>> allocate( rmsgroup(natomCL) )
>>
>> so 4 allocate statements where NO return value is checked.
>>
>> It is not just NEB which is broken though. If I try to run PIMD
>> itself:
>>
>> export DO_PARALLEL='mpirun -np 8'
>> cd $AMBERHOME/test/
>> make test.sander.PIMD.MPI.partial
>>
>> These all pass.
>>
>> make test.sander.PIMD.MPI.full
>>
>> Everything fails with invalid communicators etc. E.g.
>> cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
>> Testing Centroid MD
>> [cli_3]: aborting job:
>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>> MPI_Reduce(714): Null communicator
>> [cli_1]: aborting job:
>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>> MPI_Reduce(714): Null communicator
>> [cli_7]: aborting job:
>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>> MPI_Reduce(714): Null communicator
>> [cli_5]: aborting job:
>> Fatal error in MPI_Reduce: Invalid communicator, error stack:
>> MPI_Reduce(843): MPI_Reduce(sbuf=0x1614130, rbuf=0x60b2330, count=28,
>> MPI_DOUBLE_PRECISION, MPI_SUM, root=0, MPI_COMM_NULL) failed
>> MPI_Reduce(714): Null communicator
>> [cli_0]: aborting job:
>> Fatal error in MPI_Allreduce: Other MPI error, error stack:
>> MPI_Allreduce(696)........................:
>> MPI_Allreduce(sbuf=0x1717c68,
>> rbuf=0x1717c88, count=4, MPI_DOUBLE_PRECISION, MPI_SUM,
>> MPI_COMM_WORLD)
>> failed
>> MPIR_Allreduce(285).......................:
>> MPIC_Sendrecv(161)........................:
>> MPIC_Wait(321)............................:
>> MPIDI_CH3_Progress_wait(198)..............: an error occurred while
>> handling
>> an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(422):
>> MPIDU_Socki_handle_read(649)..............: connection failure
>> (set=0,sock=3,errno=104:(strerror() not found))
>> program error.
>>
>> So it looks like it is completely broken here.
>>
>> Interesting with 4cpus these test cases pass. So something is very
>> wrong
>> when you use something other than 4 cpus. Additionally of course
>> the NEB
>> code which used PIMD does not work.
>>
>> Any suggestions? Who is maintaining the PIMD code these days and
>> wants to
>> fix it and release some bugfixes?
>>
>> All the best
>> Ross
>>
>> /\
>> \/
>> |\oss Walker
>>
>> | Assistant Research Professor |
>> | San Diego Supercomputer Center |
>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>> | http://www.rosswalker.co.uk | PGP Key available on request |
>>
>> Note: Electronic Mail is not secure, has no guarantee of delivery,
>> may not
>> be read every day, and should not be used for urgent or sensitive
>> issues.
>>
>>
>>
>>
>
>
>
> --
> ===================================================================
> Carlos L. Simmerling, Ph.D.
> Associate Professor Phone: (631) 632-1336
> Center for Structural Biology Fax: (631) 632-1555
> CMM Bldg, Room G80
> Stony Brook University E-mail: carlos.simmerling.gmail.com
> Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
> ===================================================================
Received on Wed Sep 10 2008 - 06:07:34 PDT
Custom Search