Re: amber-developers: Question regarding use of mexit() from Robert Duke on 2007-11-16 (Amber Developers Archive Nov 2007)

From: Robert Duke <rduke.email.unc.edu>
Date: Fri, 16 Nov 2007 23:54:46 -0500

Okay, but be aware that if you hit an error in one task, slave or master,
and call MPI_Finalize() instead of MPI_Abort(), it is highly likely that a
correct implementation will hang, at least for a while, as the other tasks
block on attempted communication with the task that exited. So you do want
to be careful to use these two calls in the correct contexts; you want your
code to first of all work correctly with a correct mpi implementation; then
pick on the vendors that have the incorrect mpi implementations (Mike's list
of who doesn't work...) :-)
Best Regards - Bob

----- Original Message -----
From: "Thomas Cheatham III" <tec3.utah.edu>
To: <amber-developers.scripps.edu>
Sent: Friday, November 16, 2007 7:10 PM
Subject: Re: amber-developers: Question regarding use of mexit()

>
>> Sounds like it should be the reverse(?): any process can call mexit(6,1),
>> but
>> everyone should call mexit(6,0). Someone would have to do a code audit
>> to see
>> how consistent we are in Amber on doing this.
>
> ...however, as Bob pointed out, it is probably mute and highly
> implementation dependent as to what is actually done by the parent MPI
> process. I think there is no guarantee that either call will actually
> exit appropriately (in all implementations) and certainly
> implementations are not good enough to detect hardware timeouts, etc...
>
> Since MPI has no ability to create new threads or processes, the
> language/implementation does not have to be clear about graceful exit and
> likely the implementations are all over the place. In fact, this is a
> huge current problem on "bigred" at IU as you cannot chain MPI jobs in a
> run since they do not end properly (when your job exits the queue
> a special job goes around killing processes) although I haven't checked as
> to whether the culprit is the mpi_finalize vs. mpi_abort; maybe I'll try
> this and see if I can get sander chained on that machine...
>
> --tom
>
Received on Sun Nov 18 2007 - 06:07:56 PST