Re: [AMBER-Developers] Oops, I introduced a bug into pmemd.cuda_SPFP (master branch), but how shall I resolve it?

From: David Cerutti <dscerutti.gmail.com>
Date: Wed, 25 Oct 2017 16:16:38 -0400

It's possible the non-equilibrium MD would also be affected--I've fixed the
problem in my branch and am preparing the hotfix in master. It's kind of
remarkable that the damage from this was only one or two features of the
code, and features that I had not directly dealt with at all. In any case,
I'm also looking into what we can do to remedy the problem of so many false
positives for failures in the SPFP test suite. I ran both test suites
before committing--what happened is that there is a baseline of 40+ test
cases that "fail" in the SPFP mode, and in skimming those diffs I didn't
see the problem at first. It was only after doing about a half hour of
parsing through all diffs one by one, when looking for changes to dihedral
energies for another edit I've been doing, that I noticed the error in
constant pH simulations. It's reasonable to ask for that level of review
before commits to master, but on the other hand shouldn't we be able to
clamp down on the number of false positives?

Dave


On Wed, Oct 25, 2017 at 4:04 PM, Daniel Roe <daniel.r.roe.gmail.com> wrote:

> On Wed, Oct 25, 2017 at 1:09 PM, David Cerutti <dscerutti.gmail.com>
> wrote:
> > Beware, I've just realized that pmemd.cuda_SPFP in the master branch has
> a
> > bug that shows up in the constant pH routines (those results are bad).
> > DPFP is fine. I've determined what I think the problem--something I did
> > about two months ago with the intention of clarifying things, but I
> didn't
> > go far enough. It only shows up in the constant pH routines because
> things
> > are getting refreshed frequently, mixing situations where the intended
> > change takes effect and those where it does not.
>
> Yikes - glad you caught this! I can confirm that this is an issue; the
> CUDA explicit solvent constant pH test case energies are off by
> several thousand kcal/mol. Are you sure constant pH is the only thing
> affected? One of the non-eq. free energy test cases
> (test/cuda/nfe/pmd) also has an energy that is off by about 1 kcal/mol
> which seems like too much.
>
> Until we have working CI testing in Amber it is incumbent upon all
> developers to test test **test** the code before committing to master.
> I know that several developers (myself included) use the GIT master
> branch for production dynamics. I am in fact running some constant pH
> simulations now but fortunately I'm using the Amber 16 version (which
> I've just checked does pass the test case) so I don't need to throw
> out any results.
>
> Developers who are working with pmemd.cuda should always run the SPFP
> tests in addition to the default DPFP, especially since this is the
> code that users will actually be using 99% of the time. Maybe we
> should have different test outputs for every shader model type as
> well?
>
> Again, I think this confirms that we absolutely need CI testing. There
> is no way that a bug like this (that is detected by an existing test
> case) should have ever made it into the master branch. I can look into
> the possibility of hosting a server at NIH that will do this, although
> it may be tricky since there are a lot of hoops to jump through for
> setting up a public-facing web host.
>
> -Dan
>
> >
> > The change was this: there is an array of float2 (two fp32s) called
> > pbAtomChargeSPLJID. That is, "atom charge (single precision) PLUS
> Lennard
> > Jones type index." It's good to couple these things for memory access
> > efficiency, but it's bad because at some point we end up representing an
> > integer as a float, then casting it back to an integer, to index into
> > another array (Lennard Jones parameter tables). This is quite safe for
> the
> > Lennard Jones typing--integers in the range +/- 2^23 can be represented
> > exactly by fp32. But it's the whole issue of INTENT in the way that
> array
> > is handled--I made it so that the charge was loaded as a float (no
> change),
> > but the LJID was loaded as a __uint_as_float(), then read back as a
> > __float_as_uint() to explicitly preserve the bitwise representation.
> (This
> > can also be done with unions.) Speed is not an issue for this case--the
> > conversion back to integer happens outside a loop where the real work
> gets
> > done.
> >
> > But, before I fix this, I'd like to float (!) the question of what to do
> in
> > these situations. It looks clunky to explicitly load and unload an
> > unsigned int into the float2 data--but it serves as a reminder of what is
> > really happening and in some situations it will be faster.
> >
> > Dave
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> --
> -------------------------
> Daniel R. Roe
> Laboratory of Computational Biology
> National Institutes of Health, NHLBI
> 5635 Fishers Ln, Rm T900
> Rockville MD, 20852
> https://www.lobos.nih.gov/lcb
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Oct 25 2017 - 13:30:05 PDT
Custom Search