Re: [AMBER-Developers] How is this a race error? from dcerutti.rci.rutgers.edu on 2011-11-16 (Amber Developers Archive Nov 2011)

From: <dcerutti.rci.rutgers.edu>
Date: Wed, 16 Nov 2011 12:39:00 -0500 (EST)

I'm pushing up a new revision to mdgx... Cruise Control is gonna fly off
the rails!

But, at Dave Case's suggestion I stuck MPI_Barriers behind all my
MPI_Waitall calls and this has expunged, at least for thousands of steps,
any of the strange unitialized values that I was getting at various
MPI_Waitall calls in the past. I had been ignoring them, given that MPI
libraries are known not to be entirely valgrind clean, but it looks like
they were all race errors all along.

So, my plan for fixing this more generally is that I should go back
through the code and make sure that EVERY message that will be passed in
ANY function have a totally unique tag. Right now I have it set up where,
in each round of communication, a message has the tag (# sender)*(#
threads) + (# receiver). However, there could be messages from separate
rounds of communcation that share the same tag. I had thought that
MPI_Waitall() would prevent these rounds of communication from
overlapping, but because I have several different communication plans in
use at various times I think that race errors can occur and messages can
get crossed.

But, I'm not sure about some things here. Even if all the messages have
unique tags, if I go and remove all the MPI_Barrier() calls I could still
have messages coming in or going out when the corresponding MPI_Irecv()
has not yet been posted. What happens if a process B gets a message from
process A if it hadn't even posted a recv for a message from process A?
What happens if process B DOES expect a message from A and has a recv up
but gets a message from A with the wrong tag?

Any help is very appreciated!

Dave

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Nov 16 2011 - 10:00:03 PST