amber-developers: Parallel Problems...

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 12 Oct 2006 14:29:21 -0700

Hi All,

Okay I have tracked down where the parallel problems are coming from. The
problem was introduced by somebody who, with perhaps the exception of Dave
Case, should know better than all of us but they shall remain nameless. ;-)

However, the problem itself highlights an issue we have with the way we
currently implement things in parallel and rely on assumptions that data is
linear in memory.

The problem comes about because we continually broadcast things that used to
be in common blocks. E.g.:

   ! parms.h:

   call mpi_bcast(rk,num_BC_PARMR,MPI_DOUBLE_PRECISION,0,commsander,ierr)

Which comes from parms.f

        #define BC_PARMR 23320
        #define BC_PARMI 1200

        integer, parameter :: num_bc_parmr = BC_PARMR, num_bc_parmi =
BC_PARMI
        integer, parameter :: MAX_BOND_TYPE = 5000 !NUMBND
        integer, parameter :: MAX_ATOM_TYPE = 100 !NATYP

        _REAL_ :: &
            rk(MAX_BOND_TYPE),req(MAX_BOND_TYPE),tk(900),teq(900),pk(1200),
&
              pn(1200),phase(1200),cn1(1830),cn2(1830),solty(MAX_ATOM_TYPE), &
              gamc(1200),gams(1200),fmn(1200), &
              asol(200),bsol(200),hbcut(200)
        !common/rparms/rk,req,tk,teq,pk, &
        ! pn,phase,cn1,cn2,solty, &
        ! gamc,gams,fmn, &
        ! asol,bsol,hbcut

But here we see the problem. Somebody commented out and then later removed
(in the current tree) the common block that contained rk. Now this is fine
in serial because we use things out of the module. However the mpi broadcast
statements make the implicit assumption that everything from rk onwards is
linear in memory. With the common block gone there is no requirment for this
to be true. It is pure luck with some compilers that this is the case but
with many others it is not, hence parallel runs are completely screwed.

For the moment I am going to try and locate all of the 'missing' common
blocks and put them back in. However, I think that this is still a hack and
that if we are ultimately planning on moving everything over to modules we
need to be extremely careful when modifying things. Test in parallel with
multiple compilers. I also think we should do away with all these
broadcasting of blocks of memory and either do it the correct way and either
pack a broadcast array with the information to be sent or send each of the
arrays seperately. Or alternatively build proper structures and define our
own mpi datatypes for these structures.

Since most of these broadcasts are only done during initial setup it
probably won't hurt us to just broadcast each array individually.

Anyway, for the time being please do not check anything into cvs while I
attempt to unpick things.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
Received on Sun Oct 15 2006 - 06:07:03 PDT
Custom Search