[AMBER-Developers] How is this a race error? from dcerutti.rci.rutgers.edu on 2011-11-14 (Amber Developers Archive Nov 2011)

From: <dcerutti.rci.rutgers.edu>
Date: Tue, 15 Nov 2011 00:07:30 -0500 (EST)

Hello,

I've been hunting through the mdgx code for whatever is giving me problems
when I try to run parallel TI. I think it's actually something more
general that's coming to light as I perform more parallel simulations.

What seems to be happening is some sort of race error involved with the
way I'm using MPI_Waitall. It crops up in the function
ExtraPointLocations() of the VirtualSites.c library, and in the
UpdateCells() function of the CellManip.c library. The way in which I'm
doing the communication is really no different than in other functions
where no problems occur; however, I suspect that there may be something
generally wrong with the way I'm doing this that I haven't foreseen. It's
almost like I am having problems because I allow a process to go forward
once it has finished sending all of its messages to other processors and
receiving messages from all processors that it plans to receive from. If
I add, for instance, system("sleep 0.1") commands in between all the sends
and receives these apparent race errors disappear. I am wondering if this
happens in the ExtraPointLocations() and UpdateCells() functions because
those two functions involve back-to-back pulses of communication involving
different messaging plans.

I wonder if I'm getting situations like this:

Given:
- Processes A, B, C, and D
- Plan ONE on process A:
   Send to B
   Send to C
   Recv from D
- Plan TWO on process A:
   Send to B
   Recv from C

Code:
- Execute plan ONE on all processes (plan ONE for A is detailed above)
- MPI_Waitall
- Execute plan TWO on all processes (plan TWO for A is detailed above)
- MPI_Waitall

I have taken care to set the tags for each send / recv message pair
differently in plans ONE and TWO; in plan ONE, the tags are set by the
formula

(# sender)*(# threads total) + (# receiver)

and in plan two they are set by the formula

(# threads total)^2 + (# sender)*(# threads total) + (# receiver)

However, this doesn't seem to have fixed any problems. I am also being
careful to use different pointers for communication plans ONE and TWO. Is
there somehow a case where process A finishes with plan ONE and the
subsequent wait, then proceeds on to implement plan TWO and somehow
interferes with another process what may not have finished plan ONE? If
so, then I need to rethink how I implement the MPI_Waitall commands;
because there are other instances of this sort of code that merely
separate the implementation of different plans by some calculations, which
would otherwise be just as vulnerable to the problem.

Can anyone suggest other ideas on what may be happening? If anyone is
willing to "sit down" with me virtually I can provide my latest code, with
some debugging apparatuses in it, and a test case that produces the
behavior.

Thanks,

Dave

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon Nov 14 2011 - 21:30:03 PST