Re: [AMBER-Developers] Please read: we need to look at pmemd.cuda pair list fidelity

From: David Cerutti <dscerutti.gmail.com>
Date: Tue, 13 Feb 2018 12:34:59 -0500

Sorry this took so long. Here is the box of 980 waters whose real
thickness is just over 30A. The various gpu[ABCD] folders contain
different run scripts that give different results. gpuC is an example of
settings that will create a problem.

Dave


On Tue, Feb 13, 2018 at 8:49 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:

> Hi Dave,
>
> Can you please send those test systems and inputs so I can try to
> reproduce the pairlist problems? Thanks!
>
> -Dan
>
> On Wed, Jan 17, 2018 at 12:08 PM, David Cerutti <dscerutti.gmail.com>
> wrote:
> > Three line summary of the problem, from users' perspective, is given at
> the
> > end of my last message.
> >
> > - The impact on simulations may be felt when users got into &ewald and
> > change skinnb to a small value (unlikely) AND had just the right (or
> wrong)
> > ratio of (box length) to (cut + skinnb) to create a problem (if you
> changed
> > skinnb I'd say you then had about the odds of a Russian Roulette player).
> > - If pmemd.cuda was used to equilibrate the volume, this problem could
> have
> > persisted until the simulation was checkpointed, after which the previous
> > conditions would have to hold for the problem to persist.
> > - Need to verify this, but longer cutoffs and tighter direct sum
> tolerances
> > should naturally guard against any problems like this because pair list
> > omissions at longer range will be less consequential.
> >
> > Regarding the second bug illustrated in the image I sent out, I would
> > summarize:
> >
> > - Do not run with boxes that are only two cells in any one direction
> until
> > we find out what's really going on and fix the non-bonded inner loop.
> > There is something truly strange going on and I am not sure yet what.
> >
> > With regard to Dan's questions, I will send on the input files to repeat
> > these experiments this afternoon if people want to see the problems
> > themselves. Sander, pmemd CPU, and pmemd.CUDA set up the pair list
> > differently due to the way GPUs work. The current sander pair list is
> > inspired, if not copied wholly from, Bob Duke's pmemd code, and lays out
> a
> > much finer hash table. With pmemd there is a fine-grained aspect of the
> > hash table, but it's all based on hash cells that are assumed to be (cut
> +
> > skinnb) thick in all directions. The assumption is a good choice, we
> just
> > need to change the code so that it meets the assumption which is
> currently
> > does not. Input traps are not what will save us, edits to the neighbor
> > list setup on or about line 4200 of gpu.cpp (master branch) and the box
> > rescaling function on or about line 8400 of gpu.cpp are what is needed.
> I
> > think the fix for the second issue seen in the jumps on the plot on this
> > thread lies in kNLCPNE.h, but I'm well on my way to obliterating that
> > kernel and other aspects of the pair list building so hopefully there'll
> be
> > a whole new set of bugs soon, eh?
> >
> > Dave
> >
> >
> > On Wed, Jan 17, 2018 at 10:42 AM, Daniel Roe <daniel.r.roe.gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I'm still not sure I understand what the problem is. Right now to me
> >> it sounds like what we are missing is a bad input trap here (you
> >> should never be setting the NB skin too small etc).
> >>
> >> That said, If I take the parameters you gave in your last example
> >> (cutoff 7.0 Ang., nonbond skin 0.5001, truncated octahedron with
> >> lengths = 33.81) and use sander's pairlist setup routine I get 11
> >> cells in each dimension with individual cell separations of 2.51 Ang..
> >> With the hardcoded cell search dimension of 3 (at least in sander, not
> >> sure about pmemd) that means the smallest so-called subcell cutoff is
> >> 7.53 Ang, which is larger than cut+skinnb, so no interactions should
> >> be omitted. Maybe I'm missing something though...
> >>
> >> > A lot of study, and posts to a subset of the developers, happened
> before
> >> I made this thread on the actual listserv.
> >>
> >> You've clearly done a lot of work on this which is great. However,
> >> considering how critical building the pairlist is, we should be
> >> absolutely certain what the issue is before we go changing test cases
> >> etc. Could you send me off-list one or two of the problem systems so I
> >> can run some tests as well? I feel like the more people we throw at
> >> this issue the quicker it can get fixed.
> >>
> >> Thanks!
> >>
> >> -Dan
> >>
> >>
> >> On Wed, Jan 17, 2018 at 8:25 AM, Ross Walker <ross.rosswalker.co.uk>
> >> wrote:
> >> > Hi Dave,
> >> >
> >> > This is pretty dense so I didn't follow everything. Do you have a
> simple
> >> list of tests that show issues / a 3 line summary of all the issues and
> the
> >> scope of each?
> >> >
> >> > With regards to the very small systems - these have always been
> >> problematical and would not be recommended.
> >> > That said the code should quit rather than give incorrect answers if
> one
> >> is trying to use a box that is unreasonably small.
> >> > Should the limit not be the minimum box dimension needs to be greater
> >> than 2 x (cutoff + skinnb)? I believe the default
> >> > would be 2.0 for skinnb so in this case 20A making your box too small?
> >> >
> >> > What happens if you run with a 21A box? Do you see those jumps go
> away?
> >> If yes then we may just need a fix here to
> >> > limit such tiny boxes. I am assuming the jumps you are referring to
> are
> >> for small systems only?
> >> >
> >> > All the best
> >> > Ross
> >> >
> >> >
> >> >
> >> >> On Jan 16, 2018, at 22:45, David Cerutti <dscerutti.gmail.com>
> wrote:
> >> >>
> >> >> A lot of study, and posts to a subset of the developers, happened
> >> before I
> >> >> made this thread on the actual listserv. I was able to make a case,
> for
> >> >> example, where the box volume equilibrates under NPT and we have the
> >> >> problem of shrinking hash cells that eventually get too small for the
> >> >> cutoff. The problem was fixed for the case of rectangular unit
> cells,
> >> but
> >> >> the non-orthorhombic case isn't correctly covered. The simulation
> >> settles
> >> >> out to a volume where each unit cell is in fact less than the cutoff
> in
> >> >> thickness, irrespective of any pair list margin, but the code thinks
> >> it's
> >> >> much thicker than therefore keeps right on going. I'll make an NVE
> >> case to
> >> >> study the energy conservation effects when one of these things
> happens.
> >> >>
> >> >> In the meantime, I've been looking in to the other issue I raised.
> >> >> Apologies--I had kept the first issue off of the listserv until I
> had a
> >> >> better idea of what I was seeing, but I fired off the second issue
> >> without
> >> >> as much discussion. Nonetheless, it seems that there is something
> going
> >> >> on, although it may not quite be what I had thought. The attached
> plot
> >> >> shows the problem. I constructed a cubic box containing 216 TIP3P
> >> >> waters--familiar to anyone who's ever tiled such a thing using
> TIP3PBOX
> >> in
> >> >> tleap. It's about 18.6A on a side--just big enough to handle an 8A
> >> cutoff
> >> >> with a 1A margin. Equilibrating and then simulating that in the NVE
> >> >> ensemble, I see the energy increase as shown on the attached image.
> Red
> >> >> line = old code. Black line = new code. I need to run it further to
> >> >> verify that the new code isn't any more prone to this than the old
> code,
> >> >> but it doesn't appear that may changes to the precision model are
> doing
> >> >> anything detrimental. Furthermore, the occurrence and the size of
> the
> >> >> jumps are random. Something is weird. I've tried various
> permutations
> >> on
> >> >> this--pushing the cutoff up to 8.7A with a 0.6A skin, dialing up the
> >> >> reciprocal space settings and cutting the time step down to 0.5fs
> (from
> >> >> 1fs) to remove as much noise as possible. I have still seen one of
> >> these
> >> >> jumps in that situation, but it's too early as of yet to say whether
> the
> >> >> trend has been affected. If I switch to double precision, I can
> only go
> >> >> 1/10th as fast, and I haven't YET seen a jump, but I'm only out to
> 28ns
> >> so
> >> >> it might. Some of you may notice that, in between jumps, there is a
> >> slight
> >> >> DOWNWARD trend to the energy. If the energy isn't flat, errors of
> any
> >> sort
> >> >> should drive the energy UP, we thought, but there may be something
> about
> >> >> the fixed precision rounding that imparts an overall tendency to lose
> >> >> energy at a very slow rate (not sure on that yet, but the effect is
> >> real).
> >> >> I tried making a larger box, one which will have three hash cells in
> any
> >> >> direction, and so far I haven't seen any jumps in the energy but
> there
> >> is
> >> >> still that slight downward pressure on the overall energy. If I then
> >> take
> >> >> that larger box and increase the cutoff, so that there are now only
> >> 2x2x2
> >> >> hash cells once again (but each o them is highly oversized), I am not
> >> (yet)
> >> >> seeing any jumps of this sort, but that kind of jibes with my
> original
> >> >> hypothesis given what I know about how the pairlist works. You're
> >> going to
> >> >> see this problem when the box is really small (< 20 thick) and only
> two
> >> >> hash cells thick.
> >> >>
> >> >> By tomorrow I'll hopefully have more investigation of the first
> problem
> >> I
> >> >> brought up. For now, it looks like I wasn't crying wolf over the
> second
> >> >> one.
> >> >>
> >> >> Dave
> >> >>
> >> >>
> >> >> On Tue, Jan 16, 2018 at 10:57 AM, Daniel Roe <daniel.r.roe.gmail.com
> >
> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> On Mon, Jan 15, 2018 at 10:11 PM, David Cerutti <
> dscerutti.gmail.com>
> >> >>> wrote:
> >> >>>> What's happening is that, for non-orthogonal unit cells (and the
> >> >>>> only case the tleap really knows how to construct is the truncated
> >> >>>> octahedron, translating it into a parallelepiped with all angles at
> >> >>>> 109.47), sander and pmemd do not know how to properly gauge the
> >> distance
> >> >>>> between the box faces, and therefore the number of hash cells to
> >> create
> >> >>>> when building their pair lists.
> >> >>>
> >> >>> I don't think you are right about this. Sander (and cpptraj and I
> >> >>> think pmemd as well) determines the number of grid cells in each
> >> >>> dimension using the length of the *reciprocal* cell vectors, which
> is
> >> >>> the inverse of the fractional coordinate system. Because this
> >> >>> determination is done in reciprocal (fractional) space, it doesn't
> >> >>> matter if the unit cell is orthorhombic or not. All of the
> "gridding"
> >> >>> during the pair list build is also done in fractional space, so I
> >> >>> think the pair list builds are fine. Maybe I'm not understanding
> what
> >> >>> you're saying the problem is though. Do you have a specific test
> case
> >> >>> where the pair list is not properly set up (i.e. an atom that is
> >> >>> within the cutoff of another atom is skipped)?
> >> >>>
> >> >>> -Dan
> >> >>>
> >> >>> --
> >> >>> -------------------------
> >> >>> Daniel R. Roe
> >> >>> Laboratory of Computational Biology
> >> >>> National Institutes of Health, NHLBI
> >> >>> 5635 Fishers Ln, Rm T900
> >> >>> Rockville MD, 20852
> >> >>> https://www.lobos.nih.gov/lcb
> >> >>>
> >> >>> _______________________________________________
> >> >>> AMBER-Developers mailing list
> >> >>> AMBER-Developers.ambermd.org
> >> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >> >>>
> >> >> <EnergyGain.tif>_______________________________________________
> >> >> AMBER-Developers mailing list
> >> >> AMBER-Developers.ambermd.org
> >> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >> >
> >> >
> >> > _______________________________________________
> >> > AMBER-Developers mailing list
> >> > AMBER-Developers.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >>
> >>
> >> --
> >> -------------------------
> >> Daniel R. Roe
> >> Laboratory of Computational Biology
> >> National Institutes of Health, NHLBI
> >> 5635 Fishers Ln, Rm T900
> >> Rockville MD, 20852
> >> https://www.lobos.nih.gov/lcb
> >>
> >> _______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> --
> -------------------------
> Daniel R. Roe
> Laboratory of Computational Biology
> National Institutes of Health, NHLBI
> 5635 Fishers Ln, Rm T900
> Rockville MD, 20852
> https://www.lobos.nih.gov/lcb
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>


_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers


Received on Tue Feb 13 2018 - 10:00:03 PST
Custom Search