Re: [AMBER-Developers] Please read: we need to look at pmemd.cuda pair list fidelity

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Tue, 13 Feb 2018 08:49:24 -0500

Hi Dave,

Can you please send those test systems and inputs so I can try to
reproduce the pairlist problems? Thanks!

-Dan

On Wed, Jan 17, 2018 at 12:08 PM, David Cerutti <dscerutti.gmail.com> wrote:
> Three line summary of the problem, from users' perspective, is given at the
> end of my last message.
>
> - The impact on simulations may be felt when users got into &ewald and
> change skinnb to a small value (unlikely) AND had just the right (or wrong)
> ratio of (box length) to (cut + skinnb) to create a problem (if you changed
> skinnb I'd say you then had about the odds of a Russian Roulette player).
> - If pmemd.cuda was used to equilibrate the volume, this problem could have
> persisted until the simulation was checkpointed, after which the previous
> conditions would have to hold for the problem to persist.
> - Need to verify this, but longer cutoffs and tighter direct sum tolerances
> should naturally guard against any problems like this because pair list
> omissions at longer range will be less consequential.
>
> Regarding the second bug illustrated in the image I sent out, I would
> summarize:
>
> - Do not run with boxes that are only two cells in any one direction until
> we find out what's really going on and fix the non-bonded inner loop.
> There is something truly strange going on and I am not sure yet what.
>
> With regard to Dan's questions, I will send on the input files to repeat
> these experiments this afternoon if people want to see the problems
> themselves. Sander, pmemd CPU, and pmemd.CUDA set up the pair list
> differently due to the way GPUs work. The current sander pair list is
> inspired, if not copied wholly from, Bob Duke's pmemd code, and lays out a
> much finer hash table. With pmemd there is a fine-grained aspect of the
> hash table, but it's all based on hash cells that are assumed to be (cut +
> skinnb) thick in all directions. The assumption is a good choice, we just
> need to change the code so that it meets the assumption which is currently
> does not. Input traps are not what will save us, edits to the neighbor
> list setup on or about line 4200 of gpu.cpp (master branch) and the box
> rescaling function on or about line 8400 of gpu.cpp are what is needed. I
> think the fix for the second issue seen in the jumps on the plot on this
> thread lies in kNLCPNE.h, but I'm well on my way to obliterating that
> kernel and other aspects of the pair list building so hopefully there'll be
> a whole new set of bugs soon, eh?
>
> Dave
>
>
> On Wed, Jan 17, 2018 at 10:42 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:
>
>> Hi,
>>
>> I'm still not sure I understand what the problem is. Right now to me
>> it sounds like what we are missing is a bad input trap here (you
>> should never be setting the NB skin too small etc).
>>
>> That said, If I take the parameters you gave in your last example
>> (cutoff 7.0 Ang., nonbond skin 0.5001, truncated octahedron with
>> lengths = 33.81) and use sander's pairlist setup routine I get 11
>> cells in each dimension with individual cell separations of 2.51 Ang..
>> With the hardcoded cell search dimension of 3 (at least in sander, not
>> sure about pmemd) that means the smallest so-called subcell cutoff is
>> 7.53 Ang, which is larger than cut+skinnb, so no interactions should
>> be omitted. Maybe I'm missing something though...
>>
>> > A lot of study, and posts to a subset of the developers, happened before
>> I made this thread on the actual listserv.
>>
>> You've clearly done a lot of work on this which is great. However,
>> considering how critical building the pairlist is, we should be
>> absolutely certain what the issue is before we go changing test cases
>> etc. Could you send me off-list one or two of the problem systems so I
>> can run some tests as well? I feel like the more people we throw at
>> this issue the quicker it can get fixed.
>>
>> Thanks!
>>
>> -Dan
>>
>>
>> On Wed, Jan 17, 2018 at 8:25 AM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>> > Hi Dave,
>> >
>> > This is pretty dense so I didn't follow everything. Do you have a simple
>> list of tests that show issues / a 3 line summary of all the issues and the
>> scope of each?
>> >
>> > With regards to the very small systems - these have always been
>> problematical and would not be recommended.
>> > That said the code should quit rather than give incorrect answers if one
>> is trying to use a box that is unreasonably small.
>> > Should the limit not be the minimum box dimension needs to be greater
>> than 2 x (cutoff + skinnb)? I believe the default
>> > would be 2.0 for skinnb so in this case 20A making your box too small?
>> >
>> > What happens if you run with a 21A box? Do you see those jumps go away?
>> If yes then we may just need a fix here to
>> > limit such tiny boxes. I am assuming the jumps you are referring to are
>> for small systems only?
>> >
>> > All the best
>> > Ross
>> >
>> >
>> >
>> >> On Jan 16, 2018, at 22:45, David Cerutti <dscerutti.gmail.com> wrote:
>> >>
>> >> A lot of study, and posts to a subset of the developers, happened
>> before I
>> >> made this thread on the actual listserv. I was able to make a case, for
>> >> example, where the box volume equilibrates under NPT and we have the
>> >> problem of shrinking hash cells that eventually get too small for the
>> >> cutoff. The problem was fixed for the case of rectangular unit cells,
>> but
>> >> the non-orthorhombic case isn't correctly covered. The simulation
>> settles
>> >> out to a volume where each unit cell is in fact less than the cutoff in
>> >> thickness, irrespective of any pair list margin, but the code thinks
>> it's
>> >> much thicker than therefore keeps right on going. I'll make an NVE
>> case to
>> >> study the energy conservation effects when one of these things happens.
>> >>
>> >> In the meantime, I've been looking in to the other issue I raised.
>> >> Apologies--I had kept the first issue off of the listserv until I had a
>> >> better idea of what I was seeing, but I fired off the second issue
>> without
>> >> as much discussion. Nonetheless, it seems that there is something going
>> >> on, although it may not quite be what I had thought. The attached plot
>> >> shows the problem. I constructed a cubic box containing 216 TIP3P
>> >> waters--familiar to anyone who's ever tiled such a thing using TIP3PBOX
>> in
>> >> tleap. It's about 18.6A on a side--just big enough to handle an 8A
>> cutoff
>> >> with a 1A margin. Equilibrating and then simulating that in the NVE
>> >> ensemble, I see the energy increase as shown on the attached image. Red
>> >> line = old code. Black line = new code. I need to run it further to
>> >> verify that the new code isn't any more prone to this than the old code,
>> >> but it doesn't appear that may changes to the precision model are doing
>> >> anything detrimental. Furthermore, the occurrence and the size of the
>> >> jumps are random. Something is weird. I've tried various permutations
>> on
>> >> this--pushing the cutoff up to 8.7A with a 0.6A skin, dialing up the
>> >> reciprocal space settings and cutting the time step down to 0.5fs (from
>> >> 1fs) to remove as much noise as possible. I have still seen one of
>> these
>> >> jumps in that situation, but it's too early as of yet to say whether the
>> >> trend has been affected. If I switch to double precision, I can only go
>> >> 1/10th as fast, and I haven't YET seen a jump, but I'm only out to 28ns
>> so
>> >> it might. Some of you may notice that, in between jumps, there is a
>> slight
>> >> DOWNWARD trend to the energy. If the energy isn't flat, errors of any
>> sort
>> >> should drive the energy UP, we thought, but there may be something about
>> >> the fixed precision rounding that imparts an overall tendency to lose
>> >> energy at a very slow rate (not sure on that yet, but the effect is
>> real).
>> >> I tried making a larger box, one which will have three hash cells in any
>> >> direction, and so far I haven't seen any jumps in the energy but there
>> is
>> >> still that slight downward pressure on the overall energy. If I then
>> take
>> >> that larger box and increase the cutoff, so that there are now only
>> 2x2x2
>> >> hash cells once again (but each o them is highly oversized), I am not
>> (yet)
>> >> seeing any jumps of this sort, but that kind of jibes with my original
>> >> hypothesis given what I know about how the pairlist works. You're
>> going to
>> >> see this problem when the box is really small (< 20 thick) and only two
>> >> hash cells thick.
>> >>
>> >> By tomorrow I'll hopefully have more investigation of the first problem
>> I
>> >> brought up. For now, it looks like I wasn't crying wolf over the second
>> >> one.
>> >>
>> >> Dave
>> >>
>> >>
>> >> On Tue, Jan 16, 2018 at 10:57 AM, Daniel Roe <daniel.r.roe.gmail.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> On Mon, Jan 15, 2018 at 10:11 PM, David Cerutti <dscerutti.gmail.com>
>> >>> wrote:
>> >>>> What's happening is that, for non-orthogonal unit cells (and the
>> >>>> only case the tleap really knows how to construct is the truncated
>> >>>> octahedron, translating it into a parallelepiped with all angles at
>> >>>> 109.47), sander and pmemd do not know how to properly gauge the
>> distance
>> >>>> between the box faces, and therefore the number of hash cells to
>> create
>> >>>> when building their pair lists.
>> >>>
>> >>> I don't think you are right about this. Sander (and cpptraj and I
>> >>> think pmemd as well) determines the number of grid cells in each
>> >>> dimension using the length of the *reciprocal* cell vectors, which is
>> >>> the inverse of the fractional coordinate system. Because this
>> >>> determination is done in reciprocal (fractional) space, it doesn't
>> >>> matter if the unit cell is orthorhombic or not. All of the "gridding"
>> >>> during the pair list build is also done in fractional space, so I
>> >>> think the pair list builds are fine. Maybe I'm not understanding what
>> >>> you're saying the problem is though. Do you have a specific test case
>> >>> where the pair list is not properly set up (i.e. an atom that is
>> >>> within the cutoff of another atom is skipped)?
>> >>>
>> >>> -Dan
>> >>>
>> >>> --
>> >>> -------------------------
>> >>> Daniel R. Roe
>> >>> Laboratory of Computational Biology
>> >>> National Institutes of Health, NHLBI
>> >>> 5635 Fishers Ln, Rm T900
>> >>> Rockville MD, 20852
>> >>> https://www.lobos.nih.gov/lcb
>> >>>
>> >>> _______________________________________________
>> >>> AMBER-Developers mailing list
>> >>> AMBER-Developers.ambermd.org
>> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >>>
>> >> <EnergyGain.tif>_______________________________________________
>> >> AMBER-Developers mailing list
>> >> AMBER-Developers.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >
>> >
>> > _______________________________________________
>> > AMBER-Developers mailing list
>> > AMBER-Developers.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
>>
>> --
>> -------------------------
>> Daniel R. Roe
>> Laboratory of Computational Biology
>> National Institutes of Health, NHLBI
>> 5635 Fishers Ln, Rm T900
>> Rockville MD, 20852
>> https://www.lobos.nih.gov/lcb
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers



-- 
-------------------------
Daniel R. Roe
Laboratory of Computational Biology
National Institutes of Health, NHLBI
5635 Fishers Ln, Rm T900
Rockville MD, 20852
https://www.lobos.nih.gov/lcb
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Tue Feb 13 2018 - 06:00:05 PST
Custom Search