Re: [AMBER-Developers] pmemd.MPI build broken

From: Jason Swails <jason.swails.gmail.com>
Date: Sat, 5 Mar 2016 13:33:10 -0500

On Sat, Mar 5, 2016 at 11:25 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

>
> > On Mar 5, 2016, at 06:29, David A Case <david.case.rutgers.edu> wrote:
> >
> > On Sat, Mar 05, 2016, Jason Swails wrote:
> >>
> >> Also, when I switch to using OpenMPI *without* dragonegg, the linker
> line
> >> still needs -lgomp to complete successfully, so the build doesn't really
> >> work in general yet.
> >
> > Sounds like it's been tested mostly (only) with mpich and variants.(?)
> It's
> > suprising that the flavor of MPI library has an impact on the openmp
> stuff.
> > Maybe I'm misreading something.
> >
> > I've posted my gnu5 + mpich test results to the wiki page: I'm at commit
> > 2d5d9afbc305bfbca01. Build is fine, but I see significant (non-roundoff)
> > regressions.
>
> Can you try
>
> export OMP_NUM_THREADS=2, mpirun -np 2
>
> and see if you get the same errors please.
>
> It might be resource related - e.g. if you have 8 cores and do mpirun -np
> 4 without setting OMP_NUM_THREADS you get 32 threads total for the GB
> cases. (this will be addressed in doumentation shortly).
>

​This is dangerous and undesirable behavior in my opinion. Adding it to
the documentation is not a fix. For the longest time, ./configure -openmp
was required to get OpenMP parallelism, and MPI-parallelized programs
spawned an MPI thread for every CPU you wanted to use. This behavior has
changed for pmemd, so now if somebody uses a script they used for Amber 14
and earlier with Amber 16, they will get the same answers (once the
regressions are fixed), and it will reportedly use the same number of MPI
threads in the output, but performance will tank while they thrash their
resources. Same thing happens if they replace "sander" by "pmemd" (which
has always been the recommendation to get improved performance except where
features are only supported in one or the other). UI compatibility with
sander has always been a cornerstone of pmemd.

Mixed OpenMP-MPI has its place for sure -- MICs and dedicated
supercomputers with many cores per node. But for commodity clusters and
single workstations, I see this as more of an obstacle than a benefit. For
instance -- how do we parallelize on a single workstation now? I would
naively think you would need to do

mpirun -np 1 pmemd.MPI -O -i ...

and let OpenMP parallelize. But no, that doesn't work, because of this in
pmemd.F90:

176 #ifndef MPI
177 #ifndef CUDA
178 if (numtasks .lt. 2 .and. master) then



179 write(mdout, *) &
180 'MPI version of PMEMD must be used with 2 or more processors!'
181 call mexit(6, 1)
182 end if
183 #endif
184 #endif /*MPI*/

So how do you do it? Well you can do this:

export OMP_NUM_THREADS=1
mpirun -np 16 pmemd.MPI -O -i ...

Or you would need to do something like

export OMP_NUM_THREADS=8
mpirun -np 2 pmemd.MPI -O -i ...

Which is better? Why? What safeguards do we have in there to avoid people
thrashing their systems? What's the difference on a commodity cluster
(say, parallelizing across ~4-8 nodes with a total of ~60 CPUs) between
pmemd.MPI with and without OpenMP? I've profiled pmemd.MPI's GB scaling
several years ago, and I was rather impressed -- despite the allgatherv
every step, I could never hit the ceiling on scaling for a large system.
Of course sander.MPI's GB scaling is quite good as well (not surprisingly,
since it's really the same code). So now that we have all this added
complexity of how to run these simulations "correctly", what's the win in
performance?

IMO, MPI/OpenMP is a specialty mix. You use it when you are trying to
really squeeze out the maximum performance on expensive hardware -- when
you try to tune the right mix of SMP and distributed parallelism on
multi-core supercomputers or harness the capabilities of an Intel MIC. And
it requires a bit of tuning and experimentation/benchmarking to get the
right settings for your desired performance on a specific machine for a
specific system. And for that it's all well and good. But to take
settings that are optimized for these kinds of highly specialized
architectures and make that the default (and *only supported*) behavior on
*all* systems seems like a rather obvious mistake from the typical user's
perspective.

This is speculation, but based on real-world experience. A huge problem
here is that we have never seen this code before (it simply *existing* on
an obscure branch somewhere doesn't count -- without being in master *or*
explicitly asking for testers nobody will touch a volatile branch they know
nothing about). So nobody has any idea how this is going to play out in
the wild, and there's so little time between now and release that I don't
think we could possibly get that answer. (And the actual developer of the
code is unqualified to accurately anticipate challenges typical users will
face in my experience). This feels very http://bit.ly/1p7gB68 to me.

The two things I think we should do are

1) Make OpenMP an optional add-in that you get when you configure with
-openmp -mpi (or with -mic) and make it a separate executable so people
will only run that code when they know that's precisely what they want to
run

2) Wait to release it until a wider audience of developers have actually
gotten a chance to use it.

This is a large part of why we institute a code freeze well before release.
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Mar 05 2016 - 11:00:04 PST
Custom Search