Re: [AMBER-Developers] more pmemd.cuda.MPI issues

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 8 Dec 2010 11:34:46 -0500

On Wed, Dec 8, 2010 at 11:25 AM, Scott Le Grand <SLeGrand.nvidia.com> wrote:

> Ross's analysis is likely correct. In that case, the positions across
> multiple GPUs would become decoupled, and from there, madness would ensue...
>

Asterisk madness, that is.


>
>
> -----Original Message-----
> From: Jason Swails [mailto:jason.swails.gmail.com]
> Sent: Wednesday, December 08, 2010 07:57
> To: AMBER Developers Mailing List
> Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
>
> Works fine now.
>
> On Mon, Dec 6, 2010 at 12:20 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Hi Jason,
> >
> > Okay, I modified the code and pushed it in git. Please try building again
> > with NO_NTT3_SYNC still enabled and see if the problem on GPUs goes away.
> > The results should match a regular build (without the NO_NTT3_SYNC) so if
> > you can check that it would be very helpful.
> >
> > All the best
> > Ross
> >
> > > -----Original Message-----
> > > From: Ross Walker [mailto:ross.rosswalker.co.uk]
> > > Sent: Sunday, December 05, 2010 8:25 PM
> > > To: 'AMBER Developers Mailing List'
> > > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> > >
> > > Hi Jason,
> > >
> > > Okay, that is VERY weird since the code related to NO_NTT3_SYNC is not
> > > used
> > > when running on GPUs and so should have no effect whatsoever. On the
> > > GPU it
> > > uses its own random number stream. My guess is thus that the setting of
> > > different random seeds on each MPI thread when using NO_NTT3_SYNC is
> > > what is
> > > causing the problem. I'll modify this now so it has no effect when
> > running
> > > on the GPU.
> > >
> > > All the best
> > > Ross
> > >
> > > > -----Original Message-----
> > > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > > Sent: Sunday, December 05, 2010 1:38 PM
> > > > To: AMBER Developers Mailing List
> > > > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> > > >
> > > > Looks like it's related to the -DNO_NTT3_SYNC flag. I got the same
> > > garbage
> > > > results using OpenMPI with -DNO_NTT3_SYNC, but turning off that flag
> > > > reduced
> > > > me to what Ross was getting. I'll see if this was the cause of the
> > > problems
> > > > that Bill was seeing before.
> > > >
> > > > One thing worth considering is to make that flag fatal for CUDA
> builds,
> > > but
> > > > it's not really documented, anyway. In any case, I thought I would
> > follow
> > > > up. I'll report back on the performance of mvapich2 (the default on
> > > > Lincoln) without -DNO_NTT3_SYNC.
> > > >
> > > > All the best,
> > > > Jason
> > > >
> > > > On Sun, Dec 5, 2010 at 2:25 PM, Scott Le Grand <SLeGrand.nvidia.com>
> > > > wrote:
> > > >
> > > > > Can you try installing the latest OpenMPI and use that instead? I
> am
> > > > > seeing all sorts of sensitivity to MPI libraries and even specific
> > > builds of
> > > > > them.
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > > > Sent: Sunday, December 05, 2010 11:13
> > > > > To: AMBER Developers Mailing List
> > > > > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> > > > >
> > > > > Hi Ross,
> > > > >
> > > > > A couple differences between our config.h files. It doesn't appear
> > that
> > > > > you
> > > > > set MPI_HOME. Where you have -I/include, I have
> > > > > -I/usr/local/mvapich2-1.2-intel-ofed-1.2.5.5/include . Also, I set
> > > > > -DNO_NTT3_SYNC, would this break things? Using my config.h file,
> I'm
> > > > > getting 20 ns/day in serial (compared to your 23), and in parallel,
> I
> > > was
> > > > > getting junk at a rate of ~35 ns/day, which is considerably
> different
> > > than
> > > > > your 23.
> > > > >
> > > > > I'm trying again without -DNO_NTT3_SYNC, but I'm curious as to what
> > > > affect
> > > > > not setting MPI_HOME has on your build, although the fortran
> compiler
> > > > > should
> > > > > be picking up the mpif.h includes... Is MPI_HOME completely
> > > unnecessary
> > > > > for
> > > > > pmemd?
> > > > >
> > > > > Thanks!
> > > > > Jason
> > > > >
> > > > > On Sat, Dec 4, 2010 at 11:33 PM, Ross Walker <
> ross.rosswalker.co.uk>
> > > > > wrote:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > Works fine for me. Files I used to build along with my
> > environmental
> > > > > config
> > > > > > files are attached.
> > > > > >
> > > > > > I did.
> > > > > >
> > > > > > tar xvjf AmberTools-1.4.tar.bz
> > > > > > tar xvjf Amber11.tar.bz2
> > > > > > cd $AMBERHOME
> > > > > > wget http://ambermd.org/bugfixes/AmberTools/1.4/bugfix.all
> > > > > > patch -p0 < bugfix.all
> > > > > > rm -f bugfix.all
> > > > > > wget http://ambermd.org/bugfixes/11.0/bugfix.all
> > > > > > wget http://ambermd.org/bugfixes/apply_bugfix.x
> > > > > > chmod 755 apply_bugfix.x
> > > > > > ./apply_bugfix.x bugfix.all
> > > > > > cd AmberTools/src/
> > > > > > ./configure -cuda -mpi intel
> > > > > > cd ../../src
> > > > > > make cuda_parallel
> > > > > >
> > > > > > cd ~/
> > > > > > mkdir parallel_fail
> > > > > > cd parallel_fail
> > > > > > tar xvzf ../parallel_fail.tgz
> > > > > >
> > > > > > qsub -I -l walltime=0:30:00 -q Lincoln_debug
> > > > > >
> > > > > > cd parallel_fail
> > > > > >
> > > > > > mpirun -np 2 ~/amber11/bin/pmemd.cuda.MPI -O -p
> > > > hairpin_0.mbondi2.parm7
> > > > > > -ref
> > > > > > hairpin_0.mbondi2.heat.rst7 -c hairpin_0.mbondi2.heat.rst7
> > </dev/null
> > > > > >
> > > > > > Output file is attached.
> > > > > >
> > > > > > All the best
> > > > > > Ross
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > > > > > Sent: Saturday, December 04, 2010 3:21 PM
> > > > > > > To: AMBER Developers Mailing List
> > > > > > > Subject: [AMBER-Developers] more pmemd.cuda.MPI issues
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I ran a GB simulation on NCSA Lincoln using 2 GPUs with a
> > standard
> > > > > > nucleic
> > > > > > > acid system, and every energy term was ***********. Running in
> > > > serial,
> > > > > > all
> > > > > > > results were reasonable. I've attached the mdin, restart, and
> > > prmtop
> > > > > > files
> > > > > > > for this error.
> > > > > > >
> > > > > > > All the best,
> > > > > > > Jason
> > > > > > >
> > > > > > > --
> > > > > > > Jason M. Swails
> > > > > > > Quantum Theory Project,
> > > > > > > University of Florida
> > > > > > > Ph.D. Graduate Student
> > > > > > > 352-392-4032
> > > > > >
> > > > > > _______________________________________________
> > > > > > AMBER-Developers mailing list
> > > > > > AMBER-Developers.ambermd.org
> > > > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jason M. Swails
> > > > > Quantum Theory Project,
> > > > > University of Florida
> > > > > Ph.D. Graduate Student
> > > > > 352-392-4032
> > > > > _______________________________________________
> > > > > AMBER-Developers mailing list
> > > > > AMBER-Developers.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > >
> > > > >
> > >
> >
> >
> ----------------------------------------------------------------------------
> > > -------
> > > > > This email message is for the sole use of the intended recipient(s)
> > and
> > > may
> > > > > contain
> > > > > confidential information. Any unauthorized review, use, disclosure
> > or
> > > > > distribution
> > > > > is prohibited. If you are not the intended recipient, please
> contact
> > > the
> > > > > sender by
> > > > > reply email and destroy all copies of the original message.
> > > > >
> > > > >
> > >
> >
> >
> ----------------------------------------------------------------------------
> > > -------
> > > > >
> > > > > _______________________________________________
> > > > > AMBER-Developers mailing list
> > > > > AMBER-Developers.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jason M. Swails
> > > > Quantum Theory Project,
> > > > University of Florida
> > > > Ph.D. Graduate Student
> > > > 352-392-4032
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >
> > >
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
>
>
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Graduate Student
> 352-392-4032
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Dec 08 2010 - 09:00:02 PST
Custom Search