Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: David Cerutti <dscerutti.gmail.com>
Date: Fri, 15 Jun 2018 13:08:46 -0400

I am also able to reproduce this with a GTX-1080Ti. Haven't yet seen it
with a GP100 but I am still looking. I will run the memory checker to see
what might be the problem.

Dave


On Fri, Jun 15, 2018 at 3:07 AM, Gerald Monard <
Gerald.Monard.univ-lorraine.fr> wrote:

> Hello,
>
> On P100, amber18 with gcc-5.4.0 and cuda-8.0, same behavior:
> 3.0: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.1: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.2: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.3: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.4: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.5: 3.6: Etot = -2707313.8447 EKtot = 663835.5000 EPtot
> = -3371149.3447
> 3.7: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.8: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
> 3.9: Etot = -2707313.8447 EKtot = 663835.5000 EPtot =
> -3371149.3447
>
> cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> encountered
>
> Gerald.
>
>
> On 06/15/2018 04:01 AM, Ke Li wrote:
> > To confirm that same issue, w/ w/o P2P are observed on Tesla V100
> CUDA9.2.88+R396.26
> >
> > The different Etot = -2707218.6220 and Etot = -2709883.4871 are expected
> because P2P and non-P2P could generate different reductions.
> >
> > -----Original Message-----
> > From: David A Case <david.case.rutgers.edu>
> > Sent: Thursday, June 14, 2018 5:43 PM
> > To: AMBER Developers Mailing List <amber-developers.ambermd.org>
> > Subject: Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs
> >
> > On Thu, Jun 14, 2018, Ross Walker wrote:
> >>
> >> I keep seeing failures with AMBER 18 when running GPU validation
> >> tests.
> >
> > Ross: I'm not used to looking at these sorts of logs. Can you summarize
> a bit:
> >
> > 1. Does the problem ever happen in serial runs, on only in parallel?
> >
> > 2. Are you getting "just" crashes (illegal memory access/failed sync.
> > etc), or do you get jobs that appear to finish OK but give the wrong
> result? That is, are jobs that report Etot = -2707218.6220 really supposed
> to be the same as the ones that report Etot = -2709883.4871?
> >
> > ...thx...dac
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > ------------------------------------------------------------
> -----------------------
> > This email message is for the sole use of the intended recipient(s) and
> may contain
> > confidential information. Any unauthorized review, use, disclosure or
> distribution
> > is prohibited. If you are not the intended recipient, please contact
> the sender by
> > reply email and destroy all copies of the original message.
> > ------------------------------------------------------------
> -----------------------
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
>
> --
> ____________________________________________________________
> ________________
>
> Prof. Gerald MONARD
> Directeur du mésocentre EXPLOR
> Université de Lorraine
> Boulevard des Aiguillettes B.P. 70239
> F-54506 Vandoeuvre-les-Nancy, FRANCE
>
> e-mail : Gerald.Monard.univ-lorraine.fr
> phone : +33 (0)372.745.279
> mobile : +33 (0)678.006.443
> web : http://www.monard.info
>
> ____________________________________________________________
> ________________
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Jun 15 2018 - 10:30:02 PDT
Custom Search