Re: [AMBER-Developers] dreaded unspecified launch failures from Scott Le Grand on 2013-03-22 (Amber Developers Archive Mar 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Fri, 22 Mar 2013 11:15:58 -0700

The constant PI is zero somehow in the main tree. This is some sort of
bizarro linker issue. It is not that way in mine. All else follows from
this.

On Fri, Mar 22, 2013 at 10:56 AM, Scott Le Grand <varelse2005.gmail.com>wrote:

> I see where it's dying - the Ewald sum in NTP - I'm bewildered as to why
> right now...
>
>
>
> On Fri, Mar 22, 2013 at 9:39 AM, Scott Le Grand <varelse2005.gmail.com>wrote:
>
>> I see it too, but my work tree doesn't repro.
>>
>> Will try to figure out what's different.
>>
>>
>>
>> On Fri, Mar 22, 2013 at 9:32 AM, David A Case <case.biomaps.rutgers.edu>wrote:
>>
>>> 1. I'm getting the dreaded lauch failures on recent cuda builds, with the
>>> test suite. This is with the latest git repo, configured with "-cuda
>>> gnu"
>>> and:
>>>
>>> casegroup1% nvcc --version
>>> nvcc: NVIDIA (R) Cuda compiler driver
>>> Copyright (c) 2005-2012 NVIDIA Corporation
>>> Built on Thu_Apr__5_00:24:31_PDT_2012
>>> Cuda compilation tools, release 4.2, V0.2.1221
>>>
>>> There were about 8 launch failures, the first in the the
>>> large_solute_count
>>> or dhfr directories, with ntb2 generally set (so I'm thinking it is
>>> likely to
>>> be a problem with the new barostat codes, since all these errors occur in
>>> constant pressure runs.)
>>>
>>> 2. If I rewind to commit 511ef9f0c8227e706c1be from March 8, all these
>>> go
>>> away. (There are 10 minor diffs, which I'm assuming have something to
>>> do with
>>> using a different GPU than the saved tests results.)
>>>
>>> 3. Cruise control is not a lot of help here. For one thing, the
>>> "Test_cuda_parallel_gnu-4.4.6" build is actually not testing the cuda
>>> code
>>> at all, but appears to be running general Amber tests (not cuda tests).
>>>
>>> The "test_cuda_serial_gnu-4.4.6" build just says "352 tests experienced
>>> errors!", but the history doesn't (seem to?) go back far enough to figure
>>> out when things broke. There is a graph of testing times (which might be
>>> helpful), but the x-axis has no labels on it.
>>>
>>> The "test_cuda_parallel_intel-11.1.069" build doesn't help, since all the
>>> tests just say "no CUDA-capable device is detected"
>>>
>>> 5. Is anyone else seeing problems with cuda and the latest git? I can
>>> give
>>> more details than I did in part 1, above, but my guess is that everyone
>>> is
>>> likely to be seeing the same thing, by compiling with gnu and running the
>>> tests. But if it works for others, then I will know that my case is
>>> special,
>>> and I can spend more time trying to figure out the problem.
>>>
>>> 5. I'm wondering if we just have way too many things going on in cruise
>>> control(?) If lots of things are broken, maybe it gets too hard to fix
>>> and
>>> people get discouraged. Should we consider scaling back to a few tests,
>>> then
>>> slowly adding more as we get things to work?
>>>
>>> ...thx...dac
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>
>>
>>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Mar 22 2013 - 11:30:03 PDT