Re: [AMBER-Developers] more pmemd.cuda.MPI issues

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sun, 5 Dec 2010 21:20:30 -0800

Hi Jason,

Okay, I modified the code and pushed it in git. Please try building again
with NO_NTT3_SYNC still enabled and see if the problem on GPUs goes away.
The results should match a regular build (without the NO_NTT3_SYNC) so if
you can check that it would be very helpful.

All the best
Ross

> -----Original Message-----
> From: Ross Walker [mailto:ross.rosswalker.co.uk]
> Sent: Sunday, December 05, 2010 8:25 PM
> To: 'AMBER Developers Mailing List'
> Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
>
> Hi Jason,
>
> Okay, that is VERY weird since the code related to NO_NTT3_SYNC is not
> used
> when running on GPUs and so should have no effect whatsoever. On the
> GPU it
> uses its own random number stream. My guess is thus that the setting of
> different random seeds on each MPI thread when using NO_NTT3_SYNC is
> what is
> causing the problem. I'll modify this now so it has no effect when running
> on the GPU.
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: Jason Swails [mailto:jason.swails.gmail.com]
> > Sent: Sunday, December 05, 2010 1:38 PM
> > To: AMBER Developers Mailing List
> > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> >
> > Looks like it's related to the -DNO_NTT3_SYNC flag. I got the same
> garbage
> > results using OpenMPI with -DNO_NTT3_SYNC, but turning off that flag
> > reduced
> > me to what Ross was getting. I'll see if this was the cause of the
> problems
> > that Bill was seeing before.
> >
> > One thing worth considering is to make that flag fatal for CUDA builds,
> but
> > it's not really documented, anyway. In any case, I thought I would
follow
> > up. I'll report back on the performance of mvapich2 (the default on
> > Lincoln) without -DNO_NTT3_SYNC.
> >
> > All the best,
> > Jason
> >
> > On Sun, Dec 5, 2010 at 2:25 PM, Scott Le Grand <SLeGrand.nvidia.com>
> > wrote:
> >
> > > Can you try installing the latest OpenMPI and use that instead? I am
> > > seeing all sorts of sensitivity to MPI libraries and even specific
> builds of
> > > them.
> > >
> > >
> > > -----Original Message-----
> > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > Sent: Sunday, December 05, 2010 11:13
> > > To: AMBER Developers Mailing List
> > > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> > >
> > > Hi Ross,
> > >
> > > A couple differences between our config.h files. It doesn't appear
that
> > > you
> > > set MPI_HOME. Where you have -I/include, I have
> > > -I/usr/local/mvapich2-1.2-intel-ofed-1.2.5.5/include . Also, I set
> > > -DNO_NTT3_SYNC, would this break things? Using my config.h file, I'm
> > > getting 20 ns/day in serial (compared to your 23), and in parallel, I
> was
> > > getting junk at a rate of ~35 ns/day, which is considerably different
> than
> > > your 23.
> > >
> > > I'm trying again without -DNO_NTT3_SYNC, but I'm curious as to what
> > affect
> > > not setting MPI_HOME has on your build, although the fortran compiler
> > > should
> > > be picking up the mpif.h includes... Is MPI_HOME completely
> unnecessary
> > > for
> > > pmemd?
> > >
> > > Thanks!
> > > Jason
> > >
> > > On Sat, Dec 4, 2010 at 11:33 PM, Ross Walker <ross.rosswalker.co.uk>
> > > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > Works fine for me. Files I used to build along with my environmental
> > > config
> > > > files are attached.
> > > >
> > > > I did.
> > > >
> > > > tar xvjf AmberTools-1.4.tar.bz
> > > > tar xvjf Amber11.tar.bz2
> > > > cd $AMBERHOME
> > > > wget http://ambermd.org/bugfixes/AmberTools/1.4/bugfix.all
> > > > patch -p0 < bugfix.all
> > > > rm -f bugfix.all
> > > > wget http://ambermd.org/bugfixes/11.0/bugfix.all
> > > > wget http://ambermd.org/bugfixes/apply_bugfix.x
> > > > chmod 755 apply_bugfix.x
> > > > ./apply_bugfix.x bugfix.all
> > > > cd AmberTools/src/
> > > > ./configure -cuda -mpi intel
> > > > cd ../../src
> > > > make cuda_parallel
> > > >
> > > > cd ~/
> > > > mkdir parallel_fail
> > > > cd parallel_fail
> > > > tar xvzf ../parallel_fail.tgz
> > > >
> > > > qsub -I -l walltime=0:30:00 -q Lincoln_debug
> > > >
> > > > cd parallel_fail
> > > >
> > > > mpirun -np 2 ~/amber11/bin/pmemd.cuda.MPI -O -p
> > hairpin_0.mbondi2.parm7
> > > > -ref
> > > > hairpin_0.mbondi2.heat.rst7 -c hairpin_0.mbondi2.heat.rst7
</dev/null
> > > >
> > > > Output file is attached.
> > > >
> > > > All the best
> > > > Ross
> > > >
> > > > > -----Original Message-----
> > > > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > > > Sent: Saturday, December 04, 2010 3:21 PM
> > > > > To: AMBER Developers Mailing List
> > > > > Subject: [AMBER-Developers] more pmemd.cuda.MPI issues
> > > > >
> > > > > Hello,
> > > > >
> > > > > I ran a GB simulation on NCSA Lincoln using 2 GPUs with a standard
> > > > nucleic
> > > > > acid system, and every energy term was ***********. Running in
> > serial,
> > > > all
> > > > > results were reasonable. I've attached the mdin, restart, and
> prmtop
> > > > files
> > > > > for this error.
> > > > >
> > > > > All the best,
> > > > > Jason
> > > > >
> > > > > --
> > > > > Jason M. Swails
> > > > > Quantum Theory Project,
> > > > > University of Florida
> > > > > Ph.D. Graduate Student
> > > > > 352-392-4032
> > > >
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > >
> > > >
> > >
> > >
> > > --
> > > Jason M. Swails
> > > Quantum Theory Project,
> > > University of Florida
> > > Ph.D. Graduate Student
> > > 352-392-4032
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >
> > >
>
----------------------------------------------------------------------------
> -------
> > > This email message is for the sole use of the intended recipient(s)
and
> may
> > > contain
> > > confidential information. Any unauthorized review, use, disclosure or
> > > distribution
> > > is prohibited. If you are not the intended recipient, please contact
> the
> > > sender by
> > > reply email and destroy all copies of the original message.
> > >
> > >
>
----------------------------------------------------------------------------
> -------
> > >
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >
> >
> >
> >
> > --
> > Jason M. Swails
> > Quantum Theory Project,
> > University of Florida
> > Ph.D. Graduate Student
> > 352-392-4032
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers


_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Dec 05 2010 - 21:30:02 PST
Custom Search