Re: [AMBER-Developers] dreaded unspecified launch failures

From: Scott Le Grand <varelse2005.gmail.com>
Date: Fri, 22 Mar 2013 11:45:27 -0700

This *is* why it's failing. I pass pi_vol_inv to the scalar sum as a
parameter and to do that, not surprisingly, needs PI, on the CPU.

Is this something needed for the windows build or is this someone's clever
idea of a code cleanup?

On Fri, Mar 22, 2013 at 11:39 AM, Jason Swails <jason.swails.gmail.com>wrote:

> This was changed back in August of 2011 (git blame is a wonderful tool ;)).
>
> Did one of your recent fixes for NTP use PI for the first time on the CPU?
> That would explain why it only just started failing, maybe...
>
> On Fri, Mar 22, 2013 at 2:20 PM, Scott Le Grand <varelse2005.gmail.com
> >wrote:
>
> > And here's the problem:
> > static __constant__ const PMEDouble
> > PI = (PMEDouble) PI_VAL;
> > static __constant__ const PMEFloat
> > PI_F = (PMEFloat) PI_VAL;
> >
> >
> > This makes these values unusable from CPU code.
> >
> > Why were they changed from "const" to "__constant__"? This is not of my
> > doing.
> >
> > #define PI_VAL 3.1415926535897932384626433832795
> > static const PMEDouble PI =
> > (PMEDouble) PI_VAL;
> > static const PMEFloat PI_F =
> > (PMEFloat) PI_VAL;
> >
> > On Fri, Mar 22, 2013 at 11:15 AM, Scott Le Grand <varelse2005.gmail.com
> > >wrote:
> >
> > > The constant PI is zero somehow in the main tree. This is some sort of
> > > bizarro linker issue. It is not that way in mine. All else follows
> from
> > > this.
> > >
> > >
> > >
> > >
> > > On Fri, Mar 22, 2013 at 10:56 AM, Scott Le Grand <
> varelse2005.gmail.com
> > >wrote:
> > >
> > >> I see where it's dying - the Ewald sum in NTP - I'm bewildered as to
> why
> > >> right now...
> > >>
> > >>
> > >>
> > >> On Fri, Mar 22, 2013 at 9:39 AM, Scott Le Grand <
> varelse2005.gmail.com
> > >wrote:
> > >>
> > >>> I see it too, but my work tree doesn't repro.
> > >>>
> > >>> Will try to figure out what's different.
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Mar 22, 2013 at 9:32 AM, David A Case <
> > case.biomaps.rutgers.edu>wrote:
> > >>>
> > >>>> 1. I'm getting the dreaded lauch failures on recent cuda builds,
> with
> > >>>> the
> > >>>> test suite. This is with the latest git repo, configured with
> "-cuda
> > >>>> gnu"
> > >>>> and:
> > >>>>
> > >>>> casegroup1% nvcc --version
> > >>>> nvcc: NVIDIA (R) Cuda compiler driver
> > >>>> Copyright (c) 2005-2012 NVIDIA Corporation
> > >>>> Built on Thu_Apr__5_00:24:31_PDT_2012
> > >>>> Cuda compilation tools, release 4.2, V0.2.1221
> > >>>>
> > >>>> There were about 8 launch failures, the first in the the
> > >>>> large_solute_count
> > >>>> or dhfr directories, with ntb2 generally set (so I'm thinking it is
> > >>>> likely to
> > >>>> be a problem with the new barostat codes, since all these errors
> occur
> > >>>> in
> > >>>> constant pressure runs.)
> > >>>>
> > >>>> 2. If I rewind to commit 511ef9f0c8227e706c1be from March 8, all
> > these
> > >>>> go
> > >>>> away. (There are 10 minor diffs, which I'm assuming have something
> to
> > >>>> do with
> > >>>> using a different GPU than the saved tests results.)
> > >>>>
> > >>>> 3. Cruise control is not a lot of help here. For one thing, the
> > >>>> "Test_cuda_parallel_gnu-4.4.6" build is actually not testing the
> cuda
> > >>>> code
> > >>>> at all, but appears to be running general Amber tests (not cuda
> > tests).
> > >>>>
> > >>>> The "test_cuda_serial_gnu-4.4.6" build just says "352 tests
> > experienced
> > >>>> errors!", but the history doesn't (seem to?) go back far enough to
> > >>>> figure
> > >>>> out when things broke. There is a graph of testing times (which
> might
> > >>>> be
> > >>>> helpful), but the x-axis has no labels on it.
> > >>>>
> > >>>> The "test_cuda_parallel_intel-11.1.069" build doesn't help, since
> all
> > >>>> the
> > >>>> tests just say "no CUDA-capable device is detected"
> > >>>>
> > >>>> 5. Is anyone else seeing problems with cuda and the latest git? I
> can
> > >>>> give
> > >>>> more details than I did in part 1, above, but my guess is that
> > everyone
> > >>>> is
> > >>>> likely to be seeing the same thing, by compiling with gnu and
> running
> > >>>> the
> > >>>> tests. But if it works for others, then I will know that my case is
> > >>>> special,
> > >>>> and I can spend more time trying to figure out the problem.
> > >>>>
> > >>>> 5. I'm wondering if we just have way too many things going on in
> > cruise
> > >>>> control(?) If lots of things are broken, maybe it gets too hard to
> > fix
> > >>>> and
> > >>>> people get discouraged. Should we consider scaling back to a few
> > >>>> tests, then
> > >>>> slowly adding more as we get things to work?
> > >>>>
> > >>>> ...thx...dac
> > >>>>
> > >>>>
> > >>>>
> > >>>> _______________________________________________
> > >>>> AMBER-Developers mailing list
> > >>>> AMBER-Developers.ambermd.org
> > >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
>
>
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Mar 22 2013 - 12:00:04 PDT
Custom Search