Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: Jason Swails <jason.swails.gmail.com> Date: Sun, 17 Jun 2018 15:31:42 -0400

--
Jason M. Swails 
> On Jun 17, 2018, at 3:11 PM, David Cerutti <dscerutti.gmail.com> wrote:
> 
> Update, the code does NOT in fact show this problem as late as Jan. 26th,
> 2018. (My testing script was bugged and always referencing the current
> installation.) I am backtracking from the latest pmemd commits in March to
> see when this happened, but it appears to be a relatively short frame of
> time that does not include the bulk of GTI or things that I contributed, so
> hopefully we can pin this down reasonably soon.
> 
> My feeling is that with the further performance enhancements and a bug fix
> to this, it may be worthwhile to release this as a patch in the coming
> weeks.
> 
> Dave
> 
> 
>> On Sat, Jun 16, 2018 at 7:01 PM, David Cerutti <dscerutti.gmail.com> wrote:
>> 
>> Recovering from this odd summertime cold...
>> 
>> Having spoken to Ke and finding that cuda-memcheck is not very useful for
>> pinpointing this with the size of system we need in order to make the
>> problem appear, I've been backtracking this problem through the git
>> history.  It seems that even as early as the end of April 2017, when I had
>> made no changes but merely reformatted, the code is showing this problem.
>> I will continue to backtrack to see if I can find the point at which this
>> started, but the problem seems to go pretty far back.
>> 
>> Dave
>> 
>> 
>> On Fri, Jun 15, 2018 at 8:11 PM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>> 
>>> Hi Dave,
>>> 
>>> It happens on both serial (single GPU) and parallel (2 x GPU) runs. One
>>> of two things happens. It either runs to completion and gives the expected
>>> result or it crashes immediately after printing the Results header in mdout
>>> with an CUDA illegal memory access. It seems to be random whether it
>>> crashes or not and the crashes are maybe only 1 in 10 runs. If it doesn't
>>> crash it always seems to give the same result. It also doesn't crash with
>>> the small test or the very large test. Thus I think it is likely an array
>>> out of bounds issue that only occasionally walks on protected memory and
>>> depends on the simulations size / parameters. Valgrind might be able to
>>> track it down.
>>> 
>>> I've tried it with multiple driver versions so it doesn't look like a
>>> driver issue and it happens on multiple machines so it isn't a hardware
>>> issue.
>>> 
>>> All the best
>>> Ross
>>> 
>>>> On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu> wrote:
>>>> 
>>>> On Thu, Jun 14, 2018, Ross Walker wrote:
>>>>> 
>>>>> I keep seeing failures with AMBER 18 when running GPU validation
>>>>> tests.
>>>> 
>>>> Ross: I'm not used to looking at these sorts of logs.  Can you summarize
>>>> a bit:
>>>> 
>>>> 1. Does the problem ever happen in serial runs, on only in parallel?
>>>> 
>>>> 2. Are you getting "just" crashes (illegal memory access/failed sync.
>>>> etc), or do you get jobs that appear to finish OK but give the wrong
>>>> result?  That is, are jobs that report Etot = -2707218.6220 really
>>>> supposed to be the same as the ones that report Etot = -2709883.4871?
>>>> 
>>>> ...thx...dac
>>>> 
>>>> 
>>>> _______________________________________________
>>>> AMBER-Developers mailing list
>>>> AMBER-Developers.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> 
>>> 
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> 
>> 
>> 
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers