Re: [AMBER-Developers] A40 pmemd CUDA MPI

From: Scott Le Grand <varelse2005.gmail.com>
Date: Sat, 10 Jul 2021 16:38:49 -0700

This looks like a bug of some sort to me:

/projects2/joao.ribeiro/amber_lbsr/amber_lbsr/amber20/test/cuda/chamber/dhfr
66c66
< Etot = -3050.6670 EKtot = 2230.2661 EPtot =
-5280.9331
> Etot = -10504.5648 EKtot = 2230.2661 EPtot = -12734.8309
70c70
< EELEC = -10036.4148 EGB = -2483.6659 RESTRAINT = 0.
> EELEC = -10036.4148 EGB = -9937.5636 RESTRAINT = 0.

That is not roundoff error...


On Fri, Jul 9, 2021 at 11:31 AM Charles Lin <charles.lin.roivant.com> wrote:

> Hi Dave,
>
> I attached a diff and log file. From what I can tell almost everything
> non-remd fails. We’ve tried multiple different MPI builds (including
> building one against the system), and I’m testing these in DPFP.
>
> -Charlie
>
> From: David Cerutti <dscerutti.gmail.com>
> Reply-To: AMBER Developers Mailing List <amber-developers.ambermd.org>
> Date: Friday, July 9, 2021 at 11:30 AM
> To: AMBER Developers Mailing List <amber-developers.ambermd.org>
> Subject: Re: [AMBER-Developers] A40 pmemd CUDA MPI
>
> This smells like a random numbers thing. I may have some time in the
> coming week to look into it, but I sure don't have an A40 in my hands yet.
> Are the issues spread throughout NVE, NPT, NTT tests, GB as well as PME
> setups? From your mail it looks like some (but not all) of the non-REMD
> PME tests are failing, and the non-REMD GB tests are failing in the kinetic
> energies from step 1 onward. Do any PME non-REMD tests pass? Are you
> running the tests in DPFP or SPFP mode?
>
> Dave
>
>
> On Fri, Jul 9, 2021 at 11:17 AM Charles Lin <charles.lin.roivant.com>
> wrote:
>
> > Hi all,
> >
> > I was wondering if anyone has tried running CUDA MPI on the NVIDIA A40
> > cards. I’m currently using CUDA 11.0, and using AMD cpus. I’ve gotten the
> > following to pass:
> > pmemd
> > pmemd.MPI
> > pmemd.cuda
> >
> > It seems all REMD passes for pmemd.cuda.MPI, but for non-REMD jobs the
> > tests fail. The issue seems to stem from the kinetic energies for some
> > tests and the EGB+Kinetic Energies for GB tests (all other energy terms
> > including potential energy look fine in step 1). The velocities are
> coming
> > out different so I’m wondering if its an MPI issue in the CUDA code (?),
> > but I’m not well-versed in that part of the code, so was wondering if
> > someone could investigate that.
> >
> > Thanks!
> > Charlie
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers<
> http://lists.ambermd.org/mailman/listinfo/amber-developers>
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers<
> http://lists.ambermd.org/mailman/listinfo/amber-developers>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Jul 10 2021 - 17:00:02 PDT
Custom Search