git bisect can automate the process and identify the bad commit in log2 time.
In case you hadn’t heard of that command before.
Good luck,
Jason
--
Jason M. Swails
> On Jun 17, 2018, at 3:11 PM, David Cerutti <dscerutti.gmail.com> wrote:
>
> Update, the code does NOT in fact show this problem as late as Jan. 26th,
> 2018. (My testing script was bugged and always referencing the current
> installation.) I am backtracking from the latest pmemd commits in March to
> see when this happened, but it appears to be a relatively short frame of
> time that does not include the bulk of GTI or things that I contributed, so
> hopefully we can pin this down reasonably soon.
>
> My feeling is that with the further performance enhancements and a bug fix
> to this, it may be worthwhile to release this as a patch in the coming
> weeks.
>
> Dave
>
>
>> On Sat, Jun 16, 2018 at 7:01 PM, David Cerutti <dscerutti.gmail.com> wrote:
>>
>> Recovering from this odd summertime cold...
>>
>> Having spoken to Ke and finding that cuda-memcheck is not very useful for
>> pinpointing this with the size of system we need in order to make the
>> problem appear, I've been backtracking this problem through the git
>> history. It seems that even as early as the end of April 2017, when I had
>> made no changes but merely reformatted, the code is showing this problem.
>> I will continue to backtrack to see if I can find the point at which this
>> started, but the problem seems to go pretty far back.
>>
>> Dave
>>
>>
>> On Fri, Jun 15, 2018 at 8:11 PM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>
>>> Hi Dave,
>>>
>>> It happens on both serial (single GPU) and parallel (2 x GPU) runs. One
>>> of two things happens. It either runs to completion and gives the expected
>>> result or it crashes immediately after printing the Results header in mdout
>>> with an CUDA illegal memory access. It seems to be random whether it
>>> crashes or not and the crashes are maybe only 1 in 10 runs. If it doesn't
>>> crash it always seems to give the same result. It also doesn't crash with
>>> the small test or the very large test. Thus I think it is likely an array
>>> out of bounds issue that only occasionally walks on protected memory and
>>> depends on the simulations size / parameters. Valgrind might be able to
>>> track it down.
>>>
>>> I've tried it with multiple driver versions so it doesn't look like a
>>> driver issue and it happens on multiple machines so it isn't a hardware
>>> issue.
>>>
>>> All the best
>>> Ross
>>>
>>>> On Jun 14, 2018, at 20:43, David A Case <david.case.rutgers.edu> wrote:
>>>>
>>>> On Thu, Jun 14, 2018, Ross Walker wrote:
>>>>>
>>>>> I keep seeing failures with AMBER 18 when running GPU validation
>>>>> tests.
>>>>
>>>> Ross: I'm not used to looking at these sorts of logs. Can you summarize
>>>> a bit:
>>>>
>>>> 1. Does the problem ever happen in serial runs, on only in parallel?
>>>>
>>>> 2. Are you getting "just" crashes (illegal memory access/failed sync.
>>>> etc), or do you get jobs that appear to finish OK but give the wrong
>>>> result? That is, are jobs that report Etot = -2707218.6220 really
>>>> supposed to be the same as the ones that report Etot = -2709883.4871?
>>>>
>>>> ...thx...dac
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER-Developers mailing list
>>>> AMBER-Developers.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>
>>>
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>
>>
>>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Jun 17 2018 - 13:00:02 PDT