Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: David Cerutti <dscerutti.gmail.com>
Date: Sun, 17 Jun 2018 15:45:49 -0400

I'd do that, if I felt this was a needle in a haystack kind of thing, but I
know enough about what was happening around the time of each commit that I
want it to be a manual process. Plus, the installation on the machines I
have forces me to make local changes to the configure2 every time I pull
out a commit, so I'm fine with just perusing the commit dates and comments
to see which one is of interest next. But, thanks!

Dave


On Sun, Jun 17, 2018 at 3:31 PM, Jason Swails <jason.swails.gmail.com>
wrote:

> git bisect can automate the process and identify the bad commit in log2
> time.
>
> In case you hadn’t heard of that command before.
>
> Good luck,
> Jason
>
> --
> Jason M. Swails
>
> > On Jun 17, 2018, at 3:11 PM, David Cerutti <dscerutti.gmail.com> wrote:
> >
> > Update, the code does NOT in fact show this problem as late as Jan. 26th,
> > 2018. (My testing script was bugged and always referencing the current
> > installation.) I am backtracking from the latest pmemd commits in March
> to
> > see when this happened, but it appears to be a relatively short frame of
> > time that does not include the bulk of GTI or things that I contributed,
> so
> > hopefully we can pin this down reasonably soon.
> >
> > My feeling is that with the further performance enhancements and a bug
> fix
> > to this, it may be worthwhile to release this as a patch in the coming
> > weeks.
> >
> > Dave
> >
> >
> >> On Sat, Jun 16, 2018 at 7:01 PM, David Cerutti <dscerutti.gmail.com>
> wrote:
> >>
> >> Recovering from this odd summertime cold...
> >>
> >> Having spoken to Ke and finding that cuda-memcheck is not very useful
> for
> >> pinpointing this with the size of system we need in order to make the
> >> problem appear, I've been backtracking this problem through the git
> >> history. It seems that even as early as the end of April 2017, when I
> had
> >> made no changes but merely reformatted, the code is showing this
> problem.
> >> I will continue to backtrack to see if I can find the point at which
> this
> >> started, but the problem seems to go pretty far back.
> >>
> >> Dave
> >>
> >>
> >> On Fri, Jun 15, 2018 at 8:11 PM, Ross Walker <ross.rosswalker.co.uk>
> >> wrote:
> >>
> >>> Hi Dave,
> >>>
> >>> It happens on both serial (single GPU) and parallel (2 x GPU) runs. One
> >>> of two things happens. It either runs to completion and gives the
> expected
> >>> result or it crashes immediately after printing the Results header in
> mdout
> >>> with an CUDA illegal memory access. It seems to be random whether it
> >>> crashes or not and the crashes are maybe only 1 in 10 runs. If it
> doesn't
> >>> crash it always seems to give the same result. It also doesn't crash
> with
> >>> the small test or the very large test. Thus I think it is likely an
> array
> >>> out of bounds issue that only occasionally walks on protected memory
> and
> >>> depends on the simulations size / parameters. Valgrind might be able to
> >>> track it down.
> >>>
> >>> I've tried it with multiple driver versions so it doesn't look like a
> >>> driver issue and it happens on multiple machines so it isn't a hardware
> >>> issue.
> >>>
> >>> All the best
> >>> Ross
> >>>
> >>>> On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu>
> wrote:
> >>>>
> >>>> On Thu, Jun 14, 2018, Ross Walker wrote:
> >>>>>
> >>>>> I keep seeing failures with AMBER 18 when running GPU validation
> >>>>> tests.
> >>>>
> >>>> Ross: I'm not used to looking at these sorts of logs. Can you
> summarize
> >>>> a bit:
> >>>>
> >>>> 1. Does the problem ever happen in serial runs, on only in parallel?
> >>>>
> >>>> 2. Are you getting "just" crashes (illegal memory access/failed sync.
> >>>> etc), or do you get jobs that appear to finish OK but give the wrong
> >>>> result? That is, are jobs that report Etot = -2707218.6220 really
> >>>> supposed to be the same as the ones that report Etot = -2709883.4871?
> >>>>
> >>>> ...thx...dac
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> AMBER-Developers mailing list
> >>>> AMBER-Developers.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>
> >>>
> >>> _______________________________________________
> >>> AMBER-Developers mailing list
> >>> AMBER-Developers.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>
> >>
> >>
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Jun 17 2018 - 13:00:02 PDT
Custom Search