Re: [AMBER-Developers] Please read: we need to look at pmemd.cuda pair list fidelity from David Cerutti on 2018-01-17 (Amber Developers Archive Jan 2018)

From: David Cerutti <dscerutti.gmail.com>
Date: Wed, 17 Jan 2018 12:08:56 -0500

Three line summary of the problem, from users' perspective, is given at the
end of my last message.

- The impact on simulations may be felt when users got into &ewald and
change skinnb to a small value (unlikely) AND had just the right (or wrong)
ratio of (box length) to (cut + skinnb) to create a problem (if you changed
skinnb I'd say you then had about the odds of a Russian Roulette player).
- If pmemd.cuda was used to equilibrate the volume, this problem could have
persisted until the simulation was checkpointed, after which the previous
conditions would have to hold for the problem to persist.
- Need to verify this, but longer cutoffs and tighter direct sum tolerances
should naturally guard against any problems like this because pair list
omissions at longer range will be less consequential.

Regarding the second bug illustrated in the image I sent out, I would
summarize:

- Do not run with boxes that are only two cells in any one direction until
we find out what's really going on and fix the non-bonded inner loop.
There is something truly strange going on and I am not sure yet what.

With regard to Dan's questions, I will send on the input files to repeat
these experiments this afternoon if people want to see the problems
themselves. Sander, pmemd CPU, and pmemd.CUDA set up the pair list
differently due to the way GPUs work. The current sander pair list is
inspired, if not copied wholly from, Bob Duke's pmemd code, and lays out a
much finer hash table. With pmemd there is a fine-grained aspect of the
hash table, but it's all based on hash cells that are assumed to be (cut +
skinnb) thick in all directions. The assumption is a good choice, we just
need to change the code so that it meets the assumption which is currently
does not. Input traps are not what will save us, edits to the neighbor
list setup on or about line 4200 of gpu.cpp (master branch) and the box
rescaling function on or about line 8400 of gpu.cpp are what is needed. I
think the fix for the second issue seen in the jumps on the plot on this
thread lies in kNLCPNE.h, but I'm well on my way to obliterating that
kernel and other aspects of the pair list building so hopefully there'll be
a whole new set of bugs soon, eh?

Dave

On Wed, Jan 17, 2018 at 10:42 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:

> Hi,
>
> I'm still not sure I understand what the problem is. Right now to me
> it sounds like what we are missing is a bad input trap here (you
> should never be setting the NB skin too small etc).
>
> That said, If I take the parameters you gave in your last example
> (cutoff 7.0 Ang., nonbond skin 0.5001, truncated octahedron with
> lengths = 33.81) and use sander's pairlist setup routine I get 11
> cells in each dimension with individual cell separations of 2.51 Ang..
> With the hardcoded cell search dimension of 3 (at least in sander, not
> sure about pmemd) that means the smallest so-called subcell cutoff is
> 7.53 Ang, which is larger than cut+skinnb, so no interactions should
> be omitted. Maybe I'm missing something though...
>
> > A lot of study, and posts to a subset of the developers, happened before
> I made this thread on the actual listserv.
>
> You've clearly done a lot of work on this which is great. However,
> considering how critical building the pairlist is, we should be
> absolutely certain what the issue is before we go changing test cases
> etc. Could you send me off-list one or two of the problem systems so I
> can run some tests as well? I feel like the more people we throw at
> this issue the quicker it can get fixed.
>
> Thanks!
>
> -Dan
>
>
> On Wed, Jan 17, 2018 at 8:25 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> > Hi Dave,
> >
> > This is pretty dense so I didn't follow everything. Do you have a simple
> list of tests that show issues / a 3 line summary of all the issues and the
> scope of each?
> >
> > With regards to the very small systems - these have always been
> problematical and would not be recommended.
> > That said the code should quit rather than give incorrect answers if one
> is trying to use a box that is unreasonably small.
> > Should the limit not be the minimum box dimension needs to be greater
> than 2 x (cutoff + skinnb)? I believe the default
> > would be 2.0 for skinnb so in this case 20A making your box too small?
> >
> > What happens if you run with a 21A box? Do you see those jumps go away?
> If yes then we may just need a fix here to
> > limit such tiny boxes. I am assuming the jumps you are referring to are
> for small systems only?
> >
> > All the best
> > Ross
> >
> >
> >
> >> On Jan 16, 2018, at 22:45, David Cerutti <dscerutti.gmail.com> wrote:
> >>
> >> A lot of study, and posts to a subset of the developers, happened
> before I
> >> made this thread on the actual listserv. I was able to make a case, for
> >> example, where the box volume equilibrates under NPT and we have the
> >> problem of shrinking hash cells that eventually get too small for the
> >> cutoff. The problem was fixed for the case of rectangular unit cells,
> but
> >> the non-orthorhombic case isn't correctly covered. The simulation
> settles
> >> out to a volume where each unit cell is in fact less than the cutoff in
> >> thickness, irrespective of any pair list margin, but the code thinks
> it's
> >> much thicker than therefore keeps right on going. I'll make an NVE
> case to
> >> study the energy conservation effects when one of these things happens.
> >>
> >> In the meantime, I've been looking in to the other issue I raised.
> >> Apologies--I had kept the first issue off of the listserv until I had a
> >> better idea of what I was seeing, but I fired off the second issue
> without
> >> as much discussion. Nonetheless, it seems that there is something going
> >> on, although it may not quite be what I had thought. The attached plot
> >> shows the problem. I constructed a cubic box containing 216 TIP3P
> >> waters--familiar to anyone who's ever tiled such a thing using TIP3PBOX
> in
> >> tleap. It's about 18.6A on a side--just big enough to handle an 8A
> cutoff
> >> with a 1A margin. Equilibrating and then simulating that in the NVE
> >> ensemble, I see the energy increase as shown on the attached image. Red
> >> line = old code. Black line = new code. I need to run it further to
> >> verify that the new code isn't any more prone to this than the old code,
> >> but it doesn't appear that may changes to the precision model are doing
> >> anything detrimental. Furthermore, the occurrence and the size of the
> >> jumps are random. Something is weird. I've tried various permutations
> on
> >> this--pushing the cutoff up to 8.7A with a 0.6A skin, dialing up the
> >> reciprocal space settings and cutting the time step down to 0.5fs (from
> >> 1fs) to remove as much noise as possible. I have still seen one of
> these
> >> jumps in that situation, but it's too early as of yet to say whether the
> >> trend has been affected. If I switch to double precision, I can only go
> >> 1/10th as fast, and I haven't YET seen a jump, but I'm only out to 28ns
> so
> >> it might. Some of you may notice that, in between jumps, there is a
> slight
> >> DOWNWARD trend to the energy. If the energy isn't flat, errors of any
> sort
> >> should drive the energy UP, we thought, but there may be something about
> >> the fixed precision rounding that imparts an overall tendency to lose
> >> energy at a very slow rate (not sure on that yet, but the effect is
> real).
> >> I tried making a larger box, one which will have three hash cells in any
> >> direction, and so far I haven't seen any jumps in the energy but there
> is
> >> still that slight downward pressure on the overall energy. If I then
> take
> >> that larger box and increase the cutoff, so that there are now only
> 2x2x2
> >> hash cells once again (but each o them is highly oversized), I am not
> (yet)
> >> seeing any jumps of this sort, but that kind of jibes with my original
> >> hypothesis given what I know about how the pairlist works. You're
> going to
> >> see this problem when the box is really small (< 20 thick) and only two
> >> hash cells thick.
> >>
> >> By tomorrow I'll hopefully have more investigation of the first problem
> I
> >> brought up. For now, it looks like I wasn't crying wolf over the second
> >> one.
> >>
> >> Dave
> >>
> >>
> >> On Tue, Jan 16, 2018 at 10:57 AM, Daniel Roe <daniel.r.roe.gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> On Mon, Jan 15, 2018 at 10:11 PM, David Cerutti <dscerutti.gmail.com>
> >>> wrote:
> >>>> What's happening is that, for non-orthogonal unit cells (and the
> >>>> only case the tleap really knows how to construct is the truncated
> >>>> octahedron, translating it into a parallelepiped with all angles at
> >>>> 109.47), sander and pmemd do not know how to properly gauge the
> distance
> >>>> between the box faces, and therefore the number of hash cells to
> create
> >>>> when building their pair lists.
> >>>
> >>> I don't think you are right about this. Sander (and cpptraj and I
> >>> think pmemd as well) determines the number of grid cells in each
> >>> dimension using the length of the *reciprocal* cell vectors, which is
> >>> the inverse of the fractional coordinate system. Because this
> >>> determination is done in reciprocal (fractional) space, it doesn't
> >>> matter if the unit cell is orthorhombic or not. All of the "gridding"
> >>> during the pair list build is also done in fractional space, so I
> >>> think the pair list builds are fine. Maybe I'm not understanding what
> >>> you're saying the problem is though. Do you have a specific test case
> >>> where the pair list is not properly set up (i.e. an atom that is
> >>> within the cutoff of another atom is skipped)?
> >>>
> >>> -Dan
> >>>
> >>> --
> >>> -------------------------
> >>> Daniel R. Roe
> >>> Laboratory of Computational Biology
> >>> National Institutes of Health, NHLBI
> >>> 5635 Fishers Ln, Rm T900
> >>> Rockville MD, 20852
> >>> https://www.lobos.nih.gov/lcb
> >>>
> >>> _______________________________________________
> >>> AMBER-Developers mailing list
> >>> AMBER-Developers.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>
> >> <EnergyGain.tif>_______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> --
> -------------------------
> Daniel R. Roe
> Laboratory of Computational Biology
> National Institutes of Health, NHLBI
> 5635 Fishers Ln, Rm T900
> Rockville MD, 20852
> https://www.lobos.nih.gov/lcb
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Jan 17 2018 - 09:30:02 PST