Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs from David Cerutti on 2018-06-17 (Amber Developers Archive Jun 2018)

From: David Cerutti <dscerutti.gmail.com>
Date: Sun, 17 Jun 2018 15:11:19 -0400

Update, the code does NOT in fact show this problem as late as Jan. 26th,
2018. (My testing script was bugged and always referencing the current
installation.) I am backtracking from the latest pmemd commits in March to
see when this happened, but it appears to be a relatively short frame of
time that does not include the bulk of GTI or things that I contributed, so
hopefully we can pin this down reasonably soon.

My feeling is that with the further performance enhancements and a bug fix
to this, it may be worthwhile to release this as a patch in the coming
weeks.

Dave

On Sat, Jun 16, 2018 at 7:01 PM, David Cerutti <dscerutti.gmail.com> wrote:

> Recovering from this odd summertime cold...
>
> Having spoken to Ke and finding that cuda-memcheck is not very useful for
> pinpointing this with the size of system we need in order to make the
> problem appear, I've been backtracking this problem through the git
> history. It seems that even as early as the end of April 2017, when I had
> made no changes but merely reformatted, the code is showing this problem.
> I will continue to backtrack to see if I can find the point at which this
> started, but the problem seems to go pretty far back.
>
> Dave
>
>
> On Fri, Jun 15, 2018 at 8:11 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
>> Hi Dave,
>>
>> It happens on both serial (single GPU) and parallel (2 x GPU) runs. One
>> of two things happens. It either runs to completion and gives the expected
>> result or it crashes immediately after printing the Results header in mdout
>> with an CUDA illegal memory access. It seems to be random whether it
>> crashes or not and the crashes are maybe only 1 in 10 runs. If it doesn't
>> crash it always seems to give the same result. It also doesn't crash with
>> the small test or the very large test. Thus I think it is likely an array
>> out of bounds issue that only occasionally walks on protected memory and
>> depends on the simulations size / parameters. Valgrind might be able to
>> track it down.
>>
>> I've tried it with multiple driver versions so it doesn't look like a
>> driver issue and it happens on multiple machines so it isn't a hardware
>> issue.
>>
>> All the best
>> Ross
>>
>> > On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu> wrote:
>> >
>> > On Thu, Jun 14, 2018, Ross Walker wrote:
>> >>
>> >> I keep seeing failures with AMBER 18 when running GPU validation
>> >> tests.
>> >
>> > Ross: I'm not used to looking at these sorts of logs. Can you summarize
>> > a bit:
>> >
>> > 1. Does the problem ever happen in serial runs, on only in parallel?
>> >
>> > 2. Are you getting "just" crashes (illegal memory access/failed sync.
>> > etc), or do you get jobs that appear to finish OK but give the wrong
>> > result? That is, are jobs that report Etot = -2707218.6220 really
>> > supposed to be the same as the ones that report Etot = -2709883.4871?
>> >
>> > ...thx...dac
>> >
>> >
>> > _______________________________________________
>> > AMBER-Developers mailing list
>> > AMBER-Developers.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Jun 17 2018 - 12:30:02 PDT