Thinking more about this... perhaps the problem is in fact be confined to
TI runs. What may be happening is that I have two data structures in that
case, each divided over the various processes in similar manners and each
with identical communication schedules. The calculations (and
communications) for each data structure occur consecutively throughout the
program. It may be that one process A can finish up with the first data
structure and start on the second while some other process B is still
working on the first data structure. B may not have been on A's
communication schedule as it was finishining data structure 1 but may be
on A's communication schedule as A begins to work on data structure 2, so
B would get a message from A that might foul up the messages that it was
expecting. Is that possible?
Dave
> Hello,
>
> I've been hunting through the mdgx code for whatever is giving me problems
> when I try to run parallel TI. I think it's actually something more
> general that's coming to light as I perform more parallel simulations.
>
> What seems to be happening is some sort of race error involved with the
> way I'm using MPI_Waitall. It crops up in the function
> ExtraPointLocations() of the VirtualSites.c library, and in the
> UpdateCells() function of the CellManip.c library. The way in which I'm
> doing the communication is really no different than in other functions
> where no problems occur; however, I suspect that there may be something
> generally wrong with the way I'm doing this that I haven't foreseen. It's
> almost like I am having problems because I allow a process to go forward
> once it has finished sending all of its messages to other processors and
> receiving messages from all processors that it plans to receive from. If
> I add, for instance, system("sleep 0.1") commands in between all the sends
> and receives these apparent race errors disappear. I am wondering if this
> happens in the ExtraPointLocations() and UpdateCells() functions because
> those two functions involve back-to-back pulses of communication involving
> different messaging plans.
>
> I wonder if I'm getting situations like this:
>
> Given:
> - Processes A, B, C, and D
> - Plan ONE on process A:
> Send to B
> Send to C
> Recv from D
> - Plan TWO on process A:
> Send to B
> Recv from C
>
> Code:
> - Execute plan ONE on all processes (plan ONE for A is detailed above)
> - MPI_Waitall
> - Execute plan TWO on all processes (plan TWO for A is detailed above)
> - MPI_Waitall
>
> I have taken care to set the tags for each send / recv message pair
> differently in plans ONE and TWO; in plan ONE, the tags are set by the
> formula
>
> (# sender)*(# threads total) + (# receiver)
>
> and in plan two they are set by the formula
>
> (# threads total)^2 + (# sender)*(# threads total) + (# receiver)
>
> However, this doesn't seem to have fixed any problems. I am also being
> careful to use different pointers for communication plans ONE and TWO. Is
> there somehow a case where process A finishes with plan ONE and the
> subsequent wait, then proceeds on to implement plan TWO and somehow
> interferes with another process what may not have finished plan ONE? If
> so, then I need to rethink how I implement the MPI_Waitall commands;
> because there are other instances of this sort of code that merely
> separate the implementation of different plans by some calculations, which
> would otherwise be just as vulnerable to the problem.
>
> Can anyone suggest other ideas on what may be happening? If anyone is
> willing to "sit down" with me virtually I can provide my latest code, with
> some debugging apparatuses in it, and a test case that produces the
> behavior.
>
> Thanks,
>
> Dave
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon Nov 14 2011 - 23:30:03 PST