[AMBER-Developers] dreaded unspecified launch failures

From: David A Case <case.biomaps.rutgers.edu>
Date: Fri, 22 Mar 2013 12:32:45 -0400

1. I'm getting the dreaded lauch failures on recent cuda builds, with the
test suite. This is with the latest git repo, configured with "-cuda gnu"
and:

casegroup1% nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Thu_Apr__5_00:24:31_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

There were about 8 launch failures, the first in the the large_solute_count
or dhfr directories, with ntb2 generally set (so I'm thinking it is likely to
be a problem with the new barostat codes, since all these errors occur in
constant pressure runs.)

2. If I rewind to commit 511ef9f0c8227e706c1be from March 8, all these go
away. (There are 10 minor diffs, which I'm assuming have something to do with
using a different GPU than the saved tests results.)

3. Cruise control is not a lot of help here. For one thing, the
"Test_cuda_parallel_gnu-4.4.6" build is actually not testing the cuda code
at all, but appears to be running general Amber tests (not cuda tests).

The "test_cuda_serial_gnu-4.4.6" build just says "352 tests experienced
errors!", but the history doesn't (seem to?) go back far enough to figure
out when things broke. There is a graph of testing times (which might be
helpful), but the x-axis has no labels on it.

The "test_cuda_parallel_intel-11.1.069" build doesn't help, since all the
tests just say "no CUDA-capable device is detected"

5. Is anyone else seeing problems with cuda and the latest git? I can give
more details than I did in part 1, above, but my guess is that everyone is
likely to be seeing the same thing, by compiling with gnu and running the
tests. But if it works for others, then I will know that my case is special,
and I can spend more time trying to figure out the problem.

5. I'm wondering if we just have way too many things going on in cruise
control(?) If lots of things are broken, maybe it gets too hard to fix and
people get discouraged. Should we consider scaling back to a few tests, then
slowly adding more as we get things to work?

...thx...dac



_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Mar 22 2013 - 10:00:03 PDT
Custom Search