Re: [AMBER-Developers] more pmemd.cuda.MPI issues

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 8 Dec 2010 10:57:19 -0500

Works fine now.

On Mon, Dec 6, 2010 at 12:20 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Jason,
>
> Okay, I modified the code and pushed it in git. Please try building again
> with NO_NTT3_SYNC still enabled and see if the problem on GPUs goes away.
> The results should match a regular build (without the NO_NTT3_SYNC) so if
> you can check that it would be very helpful.
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: Ross Walker [mailto:ross.rosswalker.co.uk]
> > Sent: Sunday, December 05, 2010 8:25 PM
> > To: 'AMBER Developers Mailing List'
> > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> >
> > Hi Jason,
> >
> > Okay, that is VERY weird since the code related to NO_NTT3_SYNC is not
> > used
> > when running on GPUs and so should have no effect whatsoever. On the
> > GPU it
> > uses its own random number stream. My guess is thus that the setting of
> > different random seeds on each MPI thread when using NO_NTT3_SYNC is
> > what is
> > causing the problem. I'll modify this now so it has no effect when
> running
> > on the GPU.
> >
> > All the best
> > Ross
> >
> > > -----Original Message-----
> > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > Sent: Sunday, December 05, 2010 1:38 PM
> > > To: AMBER Developers Mailing List
> > > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> > >
> > > Looks like it's related to the -DNO_NTT3_SYNC flag. I got the same
> > garbage
> > > results using OpenMPI with -DNO_NTT3_SYNC, but turning off that flag
> > > reduced
> > > me to what Ross was getting. I'll see if this was the cause of the
> > problems
> > > that Bill was seeing before.
> > >
> > > One thing worth considering is to make that flag fatal for CUDA builds,
> > but
> > > it's not really documented, anyway. In any case, I thought I would
> follow
> > > up. I'll report back on the performance of mvapich2 (the default on
> > > Lincoln) without -DNO_NTT3_SYNC.
> > >
> > > All the best,
> > > Jason
> > >
> > > On Sun, Dec 5, 2010 at 2:25 PM, Scott Le Grand <SLeGrand.nvidia.com>
> > > wrote:
> > >
> > > > Can you try installing the latest OpenMPI and use that instead? I am
> > > > seeing all sorts of sensitivity to MPI libraries and even specific
> > builds of
> > > > them.
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > > Sent: Sunday, December 05, 2010 11:13
> > > > To: AMBER Developers Mailing List
> > > > Subject: Re: [AMBER-Developers] more pmemd.cuda.MPI issues
> > > >
> > > > Hi Ross,
> > > >
> > > > A couple differences between our config.h files. It doesn't appear
> that
> > > > you
> > > > set MPI_HOME. Where you have -I/include, I have
> > > > -I/usr/local/mvapich2-1.2-intel-ofed-1.2.5.5/include . Also, I set
> > > > -DNO_NTT3_SYNC, would this break things? Using my config.h file, I'm
> > > > getting 20 ns/day in serial (compared to your 23), and in parallel, I
> > was
> > > > getting junk at a rate of ~35 ns/day, which is considerably different
> > than
> > > > your 23.
> > > >
> > > > I'm trying again without -DNO_NTT3_SYNC, but I'm curious as to what
> > > affect
> > > > not setting MPI_HOME has on your build, although the fortran compiler
> > > > should
> > > > be picking up the mpif.h includes... Is MPI_HOME completely
> > unnecessary
> > > > for
> > > > pmemd?
> > > >
> > > > Thanks!
> > > > Jason
> > > >
> > > > On Sat, Dec 4, 2010 at 11:33 PM, Ross Walker <ross.rosswalker.co.uk>
> > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > Works fine for me. Files I used to build along with my
> environmental
> > > > config
> > > > > files are attached.
> > > > >
> > > > > I did.
> > > > >
> > > > > tar xvjf AmberTools-1.4.tar.bz
> > > > > tar xvjf Amber11.tar.bz2
> > > > > cd $AMBERHOME
> > > > > wget http://ambermd.org/bugfixes/AmberTools/1.4/bugfix.all
> > > > > patch -p0 < bugfix.all
> > > > > rm -f bugfix.all
> > > > > wget http://ambermd.org/bugfixes/11.0/bugfix.all
> > > > > wget http://ambermd.org/bugfixes/apply_bugfix.x
> > > > > chmod 755 apply_bugfix.x
> > > > > ./apply_bugfix.x bugfix.all
> > > > > cd AmberTools/src/
> > > > > ./configure -cuda -mpi intel
> > > > > cd ../../src
> > > > > make cuda_parallel
> > > > >
> > > > > cd ~/
> > > > > mkdir parallel_fail
> > > > > cd parallel_fail
> > > > > tar xvzf ../parallel_fail.tgz
> > > > >
> > > > > qsub -I -l walltime=0:30:00 -q Lincoln_debug
> > > > >
> > > > > cd parallel_fail
> > > > >
> > > > > mpirun -np 2 ~/amber11/bin/pmemd.cuda.MPI -O -p
> > > hairpin_0.mbondi2.parm7
> > > > > -ref
> > > > > hairpin_0.mbondi2.heat.rst7 -c hairpin_0.mbondi2.heat.rst7
> </dev/null
> > > > >
> > > > > Output file is attached.
> > > > >
> > > > > All the best
> > > > > Ross
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jason Swails [mailto:jason.swails.gmail.com]
> > > > > > Sent: Saturday, December 04, 2010 3:21 PM
> > > > > > To: AMBER Developers Mailing List
> > > > > > Subject: [AMBER-Developers] more pmemd.cuda.MPI issues
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I ran a GB simulation on NCSA Lincoln using 2 GPUs with a
> standard
> > > > > nucleic
> > > > > > acid system, and every energy term was ***********. Running in
> > > serial,
> > > > > all
> > > > > > results were reasonable. I've attached the mdin, restart, and
> > prmtop
> > > > > files
> > > > > > for this error.
> > > > > >
> > > > > > All the best,
> > > > > > Jason
> > > > > >
> > > > > > --
> > > > > > Jason M. Swails
> > > > > > Quantum Theory Project,
> > > > > > University of Florida
> > > > > > Ph.D. Graduate Student
> > > > > > 352-392-4032
> > > > >
> > > > > _______________________________________________
> > > > > AMBER-Developers mailing list
> > > > > AMBER-Developers.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Jason M. Swails
> > > > Quantum Theory Project,
> > > > University of Florida
> > > > Ph.D. Graduate Student
> > > > 352-392-4032
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > >
> > > >
> >
>
> ----------------------------------------------------------------------------
> > -------
> > > > This email message is for the sole use of the intended recipient(s)
> and
> > may
> > > > contain
> > > > confidential information. Any unauthorized review, use, disclosure
> or
> > > > distribution
> > > > is prohibited. If you are not the intended recipient, please contact
> > the
> > > > sender by
> > > > reply email and destroy all copies of the original message.
> > > >
> > > >
> >
>
> ----------------------------------------------------------------------------
> > -------
> > > >
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > >
> > >
> > >
> > >
> > > --
> > > Jason M. Swails
> > > Quantum Theory Project,
> > > University of Florida
> > > Ph.D. Graduate Student
> > > 352-392-4032
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Dec 08 2010 - 08:00:06 PST
Custom Search