Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: David Cerutti <dscerutti.gmail.com>
Date: Fri, 15 Jun 2018 17:46:40 -0400

If I turn up the energy reporting rate, the problem doesn't happen
immediately. The first symptom is that kinetic energy goes to NaN, then in
subsequent steps (which can be quite a few, in fact) other aspects of the
energy diagnostics turn to huge numbers or NaN. I will keep on
investigating.

Dave


On Fri, Jun 15, 2018 at 1:18 PM, Wesley Michael Botello-Smith <
wmsmith.uci.edu> wrote:

> We recieved this same error when running a large ~million atom system on
> our GTX-980 cards. The only way we seemed able to get ours to run
> continuously was to significantly reduce the timestep (for us that was
> about dt=1.5 instead of 2.0).
>
> On Fri, Jun 15, 2018 at 10:08 AM, David Cerutti <dscerutti.gmail.com>
> wrote:
>
> > I am also able to reproduce this with a GTX-1080Ti. Haven't yet seen it
> > with a GP100 but I am still looking. I will run the memory checker to
> see
> > what might be the problem.
> >
> > Dave
> >
> >
> > On Fri, Jun 15, 2018 at 3:07 AM, Gerald Monard <
> > Gerald.Monard.univ-lorraine.fr> wrote:
> >
> > > Hello,
> > >
> > > On P100, amber18 with gcc-5.4.0 and cuda-8.0, same behavior:
> > > 3.0: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.1: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.2: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.3: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.4: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.5: 3.6: Etot = -2707313.8447 EKtot = 663835.5000 EPtot
> > > = -3371149.3447
> > > 3.7: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.8: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > > 3.9: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > > -3371149.3447
> > >
> > > cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> > > encountered
> > >
> > > Gerald.
> > >
> > >
> > > On 06/15/2018 04:01 AM, Ke Li wrote:
> > > > To confirm that same issue, w/ w/o P2P are observed on Tesla V100
> > > CUDA9.2.88+R396.26
> > > >
> > > > The different Etot = -2707218.6220 and Etot = -2709883.4871 are
> > expected
> > > because P2P and non-P2P could generate different reductions.
> > > >
> > > > -----Original Message-----
> > > > From: David A Case <david.case.rutgers.edu>
> > > > Sent: Thursday, June 14, 2018 5:43 PM
> > > > To: AMBER Developers Mailing List <amber-developers.ambermd.org>
> > > > Subject: Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs
> > > >
> > > > On Thu, Jun 14, 2018, Ross Walker wrote:
> > > >>
> > > >> I keep seeing failures with AMBER 18 when running GPU validation
> > > >> tests.
> > > >
> > > > Ross: I'm not used to looking at these sorts of logs. Can you
> > summarize
> > > a bit:
> > > >
> > > > 1. Does the problem ever happen in serial runs, on only in
> parallel?
> > > >
> > > > 2. Are you getting "just" crashes (illegal memory access/failed
> > sync.
> > > > etc), or do you get jobs that appear to finish OK but give the wrong
> > > result? That is, are jobs that report Etot = -2707218.6220 really
> > supposed
> > > to be the same as the ones that report Etot = -2709883.4871?
> > > >
> > > > ...thx...dac
> > > >
> > > >
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > ------------------------------------------------------------
> > > -----------------------
> > > > This email message is for the sole use of the intended recipient(s)
> and
> > > may contain
> > > > confidential information. Any unauthorized review, use, disclosure
> or
> > > distribution
> > > > is prohibited. If you are not the intended recipient, please contact
> > > the sender by
> > > > reply email and destroy all copies of the original message.
> > > > ------------------------------------------------------------
> > > -----------------------
> > > >
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > >
> > >
> > > --
> > > ____________________________________________________________
> > > ________________
> > >
> > > Prof. Gerald MONARD
> > > Directeur du mésocentre EXPLOR
> > > Université de Lorraine
> > > Boulevard des Aiguillettes B.P. 70239
> > > F-54506 Vandoeuvre-les-Nancy, FRANCE
> > >
> > > e-mail : Gerald.Monard.univ-lorraine.fr
> > > phone : +33 (0)372.745.279
> > > mobile : +33 (0)678.006.443
> > > web : http://www.monard.info
> > >
> > > ____________________________________________________________
> > > ________________
> > >
> > >
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Jun 15 2018 - 15:00:02 PDT
Custom Search