Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: David Cerutti <dscerutti.gmail.com>
Date: Mon, 18 Jun 2018 10:18:06 -0400

I have narrowed the search to a period of time between February 4th (when
things all seem to work) and Feburary 9th (when some bizarre indeterminacy
has crept into the results). There seems to have been something after that
date that fixes up the indeterminacy, so we got back on track, but it would
appear that at some time between then and March 11th the bug that gives us
these occasional illegal memory accesses got inserted. (Bear in mind the
illegal accesses are only the final symptom, the problem could be more
subtle when it first appears.) I can only say "it appears" because I can
run 200+ trials with the February 4th code and see no issues, whereas the
most recent (last updated circa March11th) code, as well as the current GTI
branch, tends to fail about once every 40 trials.

The stuff that's changed since Feb. 4th includes a tweak to the non-bonded
erfc() spline tabulation, which actually CORRECTS some illegal memory
accesses that were happening in the code from 2017 to the start of this
year but never seem to have generated any problems (the results would have
been multiplied by zero, but the array I had allocated was nonetheless
being overrun by a few kB). The new method should be air-tight. To test
this, I tried eliminating any calls to this table by setting ntpr=1. With
every non-bonded calculation involving energy, the older fasterfc() method
is called rather than the table. We still see the bug.

The other minor change that's occurred is that I tacked the charge grid
initialization (it needs to be zeroed out at the beginning of each step)
onto the bond work units kernel rather than calling cuda_memset. This is
the right idea, but the kernel is kind of a register bottleneck so I am
thinking of some additional "best practices" edits. In any case, I tried
eliminating this new feature and forced the code back to the earlier
cuda_memset call. Again, the bug still lurks in there somewhere.

In short, I spent a good amount of time trying to track this down over the
weekend, but I haven't found it yet. I can say with some confidence that
it came in just before the release, introduced after the bulk of the
changes and improvements made over the past year, but I could use some
additional effort from those familiar with the code who were adding things
during this time period.

Dave


On Sun, Jun 17, 2018 at 3:45 PM, David Cerutti <dscerutti.gmail.com> wrote:

> I'd do that, if I felt this was a needle in a haystack kind of thing, but
> I know enough about what was happening around the time of each commit that
> I want it to be a manual process. Plus, the installation on the machines I
> have forces me to make local changes to the configure2 every time I pull
> out a commit, so I'm fine with just perusing the commit dates and comments
> to see which one is of interest next. But, thanks!
>
> Dave
>
>
> On Sun, Jun 17, 2018 at 3:31 PM, Jason Swails <jason.swails.gmail.com>
> wrote:
>
>> git bisect can automate the process and identify the bad commit in log2
>> time.
>>
>> In case you hadn’t heard of that command before.
>>
>> Good luck,
>> Jason
>>
>> --
>> Jason M. Swails
>>
>> > On Jun 17, 2018, at 3:11 PM, David Cerutti <dscerutti.gmail.com> wrote:
>> >
>> > Update, the code does NOT in fact show this problem as late as Jan.
>> 26th,
>> > 2018. (My testing script was bugged and always referencing the current
>> > installation.) I am backtracking from the latest pmemd commits in March
>> to
>> > see when this happened, but it appears to be a relatively short frame of
>> > time that does not include the bulk of GTI or things that I
>> contributed, so
>> > hopefully we can pin this down reasonably soon.
>> >
>> > My feeling is that with the further performance enhancements and a bug
>> fix
>> > to this, it may be worthwhile to release this as a patch in the coming
>> > weeks.
>> >
>> > Dave
>> >
>> >
>> >> On Sat, Jun 16, 2018 at 7:01 PM, David Cerutti <dscerutti.gmail.com>
>> wrote:
>> >>
>> >> Recovering from this odd summertime cold...
>> >>
>> >> Having spoken to Ke and finding that cuda-memcheck is not very useful
>> for
>> >> pinpointing this with the size of system we need in order to make the
>> >> problem appear, I've been backtracking this problem through the git
>> >> history. It seems that even as early as the end of April 2017, when I
>> had
>> >> made no changes but merely reformatted, the code is showing this
>> problem.
>> >> I will continue to backtrack to see if I can find the point at which
>> this
>> >> started, but the problem seems to go pretty far back.
>> >>
>> >> Dave
>> >>
>> >>
>> >> On Fri, Jun 15, 2018 at 8:11 PM, Ross Walker <ross.rosswalker.co.uk>
>> >> wrote:
>> >>
>> >>> Hi Dave,
>> >>>
>> >>> It happens on both serial (single GPU) and parallel (2 x GPU) runs.
>> One
>> >>> of two things happens. It either runs to completion and gives the
>> expected
>> >>> result or it crashes immediately after printing the Results header in
>> mdout
>> >>> with an CUDA illegal memory access. It seems to be random whether it
>> >>> crashes or not and the crashes are maybe only 1 in 10 runs. If it
>> doesn't
>> >>> crash it always seems to give the same result. It also doesn't crash
>> with
>> >>> the small test or the very large test. Thus I think it is likely an
>> array
>> >>> out of bounds issue that only occasionally walks on protected memory
>> and
>> >>> depends on the simulations size / parameters. Valgrind might be able
>> to
>> >>> track it down.
>> >>>
>> >>> I've tried it with multiple driver versions so it doesn't look like a
>> >>> driver issue and it happens on multiple machines so it isn't a
>> hardware
>> >>> issue.
>> >>>
>> >>> All the best
>> >>> Ross
>> >>>
>> >>>> On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu>
>> wrote:
>> >>>>
>> >>>> On Thu, Jun 14, 2018, Ross Walker wrote:
>> >>>>>
>> >>>>> I keep seeing failures with AMBER 18 when running GPU validation
>> >>>>> tests.
>> >>>>
>> >>>> Ross: I'm not used to looking at these sorts of logs. Can you
>> summarize
>> >>>> a bit:
>> >>>>
>> >>>> 1. Does the problem ever happen in serial runs, on only in parallel?
>> >>>>
>> >>>> 2. Are you getting "just" crashes (illegal memory access/failed sync.
>> >>>> etc), or do you get jobs that appear to finish OK but give the wrong
>> >>>> result? That is, are jobs that report Etot = -2707218.6220 really
>> >>>> supposed to be the same as the ones that report Etot = -2709883.4871?
>> >>>>
>> >>>> ...thx...dac
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> AMBER-Developers mailing list
>> >>>> AMBER-Developers.ambermd.org
>> >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> AMBER-Developers mailing list
>> >>> AMBER-Developers.ambermd.org
>> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >>>
>> >>
>> >>
>> > _______________________________________________
>> > AMBER-Developers mailing list
>> > AMBER-Developers.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon Jun 18 2018 - 07:30:03 PDT
Custom Search