Re: [AMBER-Developers] Random crashes AMBER 18 on GPUs

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 15 Jun 2018 20:11:17 -0400

Hi Dave,

It happens on both serial (single GPU) and parallel (2 x GPU) runs. One of two things happens. It either runs to completion and gives the expected result or it crashes immediately after printing the Results header in mdout with an CUDA illegal memory access. It seems to be random whether it crashes or not and the crashes are maybe only 1 in 10 runs. If it doesn't crash it always seems to give the same result. It also doesn't crash with the small test or the very large test. Thus I think it is likely an array out of bounds issue that only occasionally walks on protected memory and depends on the simulations size / parameters. Valgrind might be able to track it down.

I've tried it with multiple driver versions so it doesn't look like a driver issue and it happens on multiple machines so it isn't a hardware issue.

All the best
Ross

> On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu> wrote:
>
> On Thu, Jun 14, 2018, Ross Walker wrote:
>>
>> I keep seeing failures with AMBER 18 when running GPU validation
>> tests.
>
> Ross: I'm not used to looking at these sorts of logs. Can you summarize
> a bit:
>
> 1. Does the problem ever happen in serial runs, on only in parallel?
>
> 2. Are you getting "just" crashes (illegal memory access/failed sync.
> etc), or do you get jobs that appear to finish OK but give the wrong
> result? That is, are jobs that report Etot = -2707218.6220 really
> supposed to be the same as the ones that report Etot = -2709883.4871?
>
> ...thx...dac
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers


_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Jun 15 2018 - 17:30:03 PDT
Custom Search