So, you encounter an error in a slave. What do you do? MPI_Abort(), at
least in all doc I have seen, has no restrictions, on who can call it, and
calling it brings down all tasks in the communicator (and possibly all tasks
in the job period, depending on implementation). Of course, any given
implementation may not handle cleanup correctly, resulting in hung tasks,
but generally hung tasks are due to other causes (a common one being a
hardware failure isolating an mpi task, and then other tasks hang waiting on
it, and the h/w error detection errors are not up to snuff). Some
implementations can also hang in deadlocks associated with trying to get mpi
buffers (I have seen this with mpich on linux clusters, especially for
sander which is kind of a hog for mpi buffers). One reason I retain a
"SLOW_NONBLOCKING_MPI" implementation of pmemd is for use on such clusters
because sync comm is used which takes less buffer space (but would be deadly
slow at very high processor count).
Regards - Bob
----- Original Message -----
From: "Daniel R. Roe" <daniel.r.roe.gmail.com>
To: <amber-developers.scripps.edu>
Sent: Friday, November 16, 2007 4:50 PM
Subject: amber-developers: Question regarding use of mexit()
> Hello all, Dan Roe from the Simmerling lab here. I had a question as to
> the correct use of mexit() calls.
>
> According to the source code I have (current as of 11-06-2007), mexit(6,0)
> will call MPI_FINALIZE, while mexit(6,1) will call MPI_ABORT.
>
> From what I have read of these procedures in the MPI documentation,
> MPI_FINALIZE should be called by all processes for normal termination.
> However, it seems that since MPI_ABORT tries to kill all processes in the
> given communicator (comm_world in this case), it should only be called by
> the overall master (worldrank==0); if mutiple threads call MPI_ABORT for
> the same communicator will this result in some processes hanging because
> they end up waiting for other processes to try the abort?
>
> If this is the case then mexit(6,0) should only be called by worldrank==0,
> and it should be ensured that mexit(6,1) is executed by all processes.
> Does this make sense?
>
> Thanks!
>
> -Dan Roe
>
> --
> -------------------------
> Daniel R. Roe, Ph.D.
> Department of Chemistry
> Stony Brook University
> Stony Brook, NY, 11790
> 631-632-1560
>
>
Received on Sun Nov 18 2007 - 06:07:53 PST