Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: Wesley Michael Botello-Smith <wmsmith.uci.edu>
Date: Fri, 15 Jun 2018 10:18:31 -0700

We recieved this same error when running a large ~million atom system on
our GTX-980 cards. The only way we seemed able to get ours to run
continuously was to significantly reduce the timestep (for us that was
about dt=1.5 instead of 2.0).

On Fri, Jun 15, 2018 at 10:08 AM, David Cerutti <dscerutti.gmail.com> wrote:

> I am also able to reproduce this with a GTX-1080Ti. Haven't yet seen it
> with a GP100 but I am still looking. I will run the memory checker to see
> what might be the problem.
>
> Dave
>
>
> On Fri, Jun 15, 2018 at 3:07 AM, Gerald Monard <
> Gerald.Monard.univ-lorraine.fr> wrote:
>
> > Hello,
> >
> > On P100, amber18 with gcc-5.4.0 and cuda-8.0, same behavior:
> > 3.0: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.1: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.2: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.3: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.4: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.5: 3.6: Etot = -2707313.8447 EKtot = 663835.5000 EPtot
> > = -3371149.3447
> > 3.7: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.8: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> > 3.9: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> > -3371149.3447
> >
> > cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> > encountered
> >
> > Gerald.
> >
> >
> > On 06/15/2018 04:01 AM, Ke Li wrote:
> > > To confirm that same issue, w/ w/o P2P are observed on Tesla V100
> > CUDA9.2.88+R396.26
> > >
> > > The different Etot = -2707218.6220 and Etot = -2709883.4871 are
> expected
> > because P2P and non-P2P could generate different reductions.
> > >
> > > -----Original Message-----
> > > From: David A Case <david.case.rutgers.edu>
> > > Sent: Thursday, June 14, 2018 5:43 PM
> > > To: AMBER Developers Mailing List <amber-developers.ambermd.org>
> > > Subject: Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs
> > >
> > > On Thu, Jun 14, 2018, Ross Walker wrote:
> > >>
> > >> I keep seeing failures with AMBER 18 when running GPU validation
> > >> tests.
> > >
> > > Ross: I'm not used to looking at these sorts of logs. Can you
> summarize
> > a bit:
> > >
> > > 1. Does the problem ever happen in serial runs, on only in parallel?
> > >
> > > 2. Are you getting "just" crashes (illegal memory access/failed
> sync.
> > > etc), or do you get jobs that appear to finish OK but give the wrong
> > result? That is, are jobs that report Etot = -2707218.6220 really
> supposed
> > to be the same as the ones that report Etot = -2709883.4871?
> > >
> > > ...thx...dac
> > >
> > >
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > ------------------------------------------------------------
> > -----------------------
> > > This email message is for the sole use of the intended recipient(s) and
> > may contain
> > > confidential information. Any unauthorized review, use, disclosure or
> > distribution
> > > is prohibited. If you are not the intended recipient, please contact
> > the sender by
> > > reply email and destroy all copies of the original message.
> > > ------------------------------------------------------------
> > -----------------------
> > >
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >
> >
> > --
> > ____________________________________________________________
> > ________________
> >
> > Prof. Gerald MONARD
> > Directeur du mésocentre EXPLOR
> > Université de Lorraine
> > Boulevard des Aiguillettes B.P. 70239
> > F-54506 Vandoeuvre-les-Nancy, FRANCE
> >
> > e-mail : Gerald.Monard.univ-lorraine.fr
> > phone : +33 (0)372.745.279
> > mobile : +33 (0)678.006.443
> > web : http://www.monard.info
> >
> > ____________________________________________________________
> > ________________
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Jun 15 2018 - 10:30:03 PDT
Custom Search