Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: David Cerutti <dscerutti.gmail.com>
Date: Sat, 16 Jun 2018 19:01:47 -0400

Recovering from this odd summertime cold...

Having spoken to Ke and finding that cuda-memcheck is not very useful for
pinpointing this with the size of system we need in order to make the
problem appear, I've been backtracking this problem through the git
history. It seems that even as early as the end of April 2017, when I had
made no changes but merely reformatted, the code is showing this problem.
I will continue to backtrack to see if I can find the point at which this
started, but the problem seems to go pretty far back.

Dave


On Fri, Jun 15, 2018 at 8:11 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Dave,
>
> It happens on both serial (single GPU) and parallel (2 x GPU) runs. One of
> two things happens. It either runs to completion and gives the expected
> result or it crashes immediately after printing the Results header in mdout
> with an CUDA illegal memory access. It seems to be random whether it
> crashes or not and the crashes are maybe only 1 in 10 runs. If it doesn't
> crash it always seems to give the same result. It also doesn't crash with
> the small test or the very large test. Thus I think it is likely an array
> out of bounds issue that only occasionally walks on protected memory and
> depends on the simulations size / parameters. Valgrind might be able to
> track it down.
>
> I've tried it with multiple driver versions so it doesn't look like a
> driver issue and it happens on multiple machines so it isn't a hardware
> issue.
>
> All the best
> Ross
>
> > On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu> wrote:
> >
> > On Thu, Jun 14, 2018, Ross Walker wrote:
> >>
> >> I keep seeing failures with AMBER 18 when running GPU validation
> >> tests.
> >
> > Ross: I'm not used to looking at these sorts of logs. Can you summarize
> > a bit:
> >
> > 1. Does the problem ever happen in serial runs, on only in parallel?
> >
> > 2. Are you getting "just" crashes (illegal memory access/failed sync.
> > etc), or do you get jobs that appear to finish OK but give the wrong
> > result? That is, are jobs that report Etot = -2707218.6220 really
> > supposed to be the same as the ones that report Etot = -2709883.4871?
> >
> > ...thx...dac
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Jun 16 2018 - 16:30:02 PDT
Custom Search