Re: amber-developers: 'MPI_BCAST : Message truncated' error from Scott Brozell on 2007-03-06 (Amber Developers Archive Mar 2007)

From: Scott Brozell <sbrozell.scripps.edu>
Date: Tue, 6 Mar 2007 14:43:40 -0800

Hi,

This smells like a multinode MPI issue.
Here are the usual questions:
What is your mpi implementation ?
Have you tested your mpi implementation ?
Have you tested your mpi implementation for multinode usage ?
In particular, internode connectivity via rsh/ssh ?
And proper ring setup for MPICH2 ?
Try 2 total processors each on a different node ...

Scott

On Tue, 6 Mar 2007, Ilyas Yildirim wrote:

> This is a system where I am following the Thermodynamic Integration
> Approach. I have to use multisander. With 2 cpu's, everything is fine,
> though.
>
> On Tue, 6 Mar 2007, Carlos Simmerling wrote:
>
> > does it work when you are not using multisander?
> >
> > On 3/6/07, Ilyas Yildirim <yildirim.pas.rochester.edu> wrote:
> > > Dear All,
> > >
> > > Using sander.MPI in a minimization with 2 cpu's work fine, but if I try to
> > > use 4/8/... cpu's, it is giving me the following error:
> > >
> > > ---------------------------------------------------------------------
> > > arde00:/home/yildirim/test/l_0.2>runmin &
> > > [1] 2428
> > > arde00:/home/yildirim/test/l_0.2>/bin/rm: No match.
> > > mpirun -stdin /dev/null -np 4 -nolocal -machinefile /tmp/tmp.mpi.2434
> > > /home/yildirim/amber9/exe/sander.MPI -ng 2 -groupfile
> > > /home/yildirim/test/l_0.2/groups_min1; rm -f /tmp/tmp.mpi.2434
> > > running on arde11:1 arde12:1 arde13:2
> > >
> > > Running multisander version of sander amber9
> > > Total processors = 4
> > > Number of groups = 2
> > >
> > > Looping over processors:
> > > WorldRank is the global PE rank
> > > NodeID is the local PE rank in current group
> > >
> > > Group = 0
> > > WorldRank = 0
> > > NodeID = 0
> > >
> > > WorldRank = 1
> > > NodeID = 1
> > >
> > > Group = 1
> > > WorldRank = 2
> > > NodeID = 0
> > >
> > > WorldRank = 3
> > > NodeID = 1
> > >
> > > p3_19669: p4_error: : 14
> > > 3 - MPI_BCAST : Message truncated
> > > [3] Aborting program !
> > > [3] Aborting program!
> > > p1_7187: p4_error: : 14
> > > 1 - MPI_BCAST : Message truncated
> > > [1] Aborting program !
> > > [1] Aborting program!
> > > rm_l_3_19670: (2.024163) net_send: could not write to fd=5, errno = 32
> > > rm_l_1_7188: (2.869610) net_send: could not write to fd=5, errno = 32
> > > p2_12533: p4_error: net_recv read: probable EOF on socket: 1
> > > rm_l_2_12534: (2.259215) net_send: could not write to fd=5, errno = 32
> > > p1_7187: (2.871182) net_send: could not write to fd=5, errno = 32
> > > p2_12533: (6.264409) net_send: could not write to fd=5, errno = 32
> > > mpirun -stdin /dev/null -np 4 -nolocal -machinefile /tmp/tmp.mpi.2595
> > > /home/yildirim/amber9/exe/sander.MPI -ng 2 -groupfile
> > > /home/yildirim/test/l_0.2/groups_min2; rm -f /tmp/tmp.mpi.2595
> > > ---------------------------------------------------------------------
> > >
> > > For the md runs, I dont see any problems (can run with 4/8/... cpu's). The
> > > system is an 8-mer solvated with water. I was wondering if this is normal
> > > for AMBER9, or if I am missing something. Thanks.
Received on Wed Mar 07 2007 - 06:07:46 PST