Re: [AMBER-Developers] Please read: we need to look at pmemd.cuda pair list fidelity from David Cerutti on 2018-01-17 (Amber Developers Archive Jan 2018)

From: David Cerutti <dscerutti.gmail.com>
Date: Wed, 17 Jan 2018 04:43:54 -0500

Breathe a bit easier, Amberites. But we're not in the clear yet.

I'll be doing more investigation into both issues, but I've been able to
expose the vulnerability in an octahedral water cell. Using Amber16, I
constructed and more or less equilibrated a unit cell of 980 water
molecules, which gives me edge lengths of 33.81. The magic number I wanted
when I divide the box length by 1.125 (the cut_factor for octahedral cells
defined in the Fortran code) was just over 30.0, and 33.81 gives me 30.05.
Bingo. Then, I defined a simulation with a cutoff of 7.0A, using fswitch
6.0 to tamp down on the vdW truncation problem (it smoothly switches the
force to zero to improve energy conservation--please someone correct me if
that assumption is wrong). I defined the pair list margin at 0.5001A, the
smallest that pmemd will accept. This gets me JUST under the threshold for
having a 4x4x4 hash cell table. If I increase the pair list margin at all,
I get a 3x3x3 hash cell table. By my calculations, the 4x4x4 hash cell
table is ACTUALLY violating the 7A cutoff slightly from the start (the code
thinks that it's got 30A of box thickness to work with, but in reality it's
only got 27.6), and any movement of particles before the next pair list
refresh only makes things worse. So, I then expect that running this
simulation with a pair list margin of 0.5001A is going to see it gain
energy due to incomplete interactions, while running it with, say a 0.52A
pair list margin, which isn't supposed to change the potential at all, is
going to trigger the proper 3x3x3 hash cell table and conserve energy.

I'm using a 2fs time step to make this go faster, but all the water is
rigid in any case. Over the course of 10ns, I've watched the system with
the 0.5001A pair list margin and 4x4x4 hash cell table gain energy at a
steady rate of about 1 degree per ns--pretty high. In contrast, the same
simulation run with a 0.5200A pair list margin and 3x3x3 hash cell table is
conserving energy reasonably well (still a gain, but only at about 3% of
the 4x4x4 simulation's rate).

The bug is real! But, I did some fairly aggressive things in this test to
expose it. One thing that may be saving us from much worse happening is
that the initial cell count seems to be calculated a bit differently than
the pair list margin itself. The initial cell count is not quite right,
but it's not as bad as the pair list margin that the GPU code then
recalculates. Notice how the code tips from 4x4x4 to 3x3x3 when cutoff +
margin cross 7.502A--in reality it should've crossed that mark at 6.9A, but
even after than mistake the code then recalculates the pair list margin to
be 1.29A (I reached into the code to pull out that number). So even though
it would've said "too much for 4x4x4 cells" if I upped the margin to 0.52A,
it goes back and calculates that it's somehow got 1.29A to work with. So
the initial misstep in determining the true cell count is only a fraction
as bad as the misstep of letting the pair list endure long after it's
expired. If one takes the default pair list margin, 2A, it's actually much
harder to hit problems. That pair list margin is a considerable fraction
of the usual cutoff--so it's hard to get into a situation where the
mistaken hash cell thickness calculation will lead to situations where the
pair list is incomplete from the beginning. In the vast majority of these
runs, it's probably been the case that the pair list is running beyond its
expiration, but the random walks being done by all particles over greater
distances and the need for several particles to come together independently
to make an actual violation greatly reduces the chances of this causing an
actual pair list omission. In the 3x3x3 cell case, the pair list margin is
huge (probably 4-5A), and it's still being overestimated, meaning that the
pair list is NOT being refreshed frequently enough, but the energy gain is
much lower because the frequency of actual pair list omissions is small.

The only other situation that I bet can get us in trouble is when the
simulation starts out significantly bigger than it ends up, and people
think they're OK because pmemd didn't crash. It's times like this when
pmemd's calculated effective pair list margin can again get really small,
which can let the real margin dip into negative territory, and we're right
back to the 4x4x4 hash cell case above.

In summary:
- The bug is real and must be fixed.
- The impact on simulations may be felt when users got into &ewald and
change skinnb to a small value (unlikely) AND had just the right (or wrong)
ratio of (box length) to (cut + skinnb) to create a problem (if you changed
skinnb I'd say you then had about the odds of a Russian Roulette player).
- If pmemd.cuda was used to equilibrate the volume, this problem could have
persisted until the simulation was checkpointed, after which the previous
conditions would have to hold for the problem to persist.
- Need to verify this, but longer cutoffs and tighter direct sum tolerances
should naturally guard against any problems like this because pair list
omissions at longer range will be less consequential.

Now, on to the second issue, which is again a rather niche sort of thing
but is rather quixotic and therefore worrisome.

Dave

On Tue, Jan 16, 2018 at 10:45 PM, David Cerutti <dscerutti.gmail.com> wrote:

> A lot of study, and posts to a subset of the developers, happened before I
> made this thread on the actual listserv. I was able to make a case, for
> example, where the box volume equilibrates under NPT and we have the
> problem of shrinking hash cells that eventually get too small for the
> cutoff. The problem was fixed for the case of rectangular unit cells, but
> the non-orthorhombic case isn't correctly covered. The simulation settles
> out to a volume where each unit cell is in fact less than the cutoff in
> thickness, irrespective of any pair list margin, but the code thinks it's
> much thicker than therefore keeps right on going. I'll make an NVE case to
> study the energy conservation effects when one of these things happens.
>
> In the meantime, I've been looking in to the other issue I raised.
> Apologies--I had kept the first issue off of the listserv until I had a
> better idea of what I was seeing, but I fired off the second issue without
> as much discussion. Nonetheless, it seems that there is something going
> on, although it may not quite be what I had thought. The attached plot
> shows the problem. I constructed a cubic box containing 216 TIP3P
> waters--familiar to anyone who's ever tiled such a thing using TIP3PBOX in
> tleap. It's about 18.6A on a side--just big enough to handle an 8A cutoff
> with a 1A margin. Equilibrating and then simulating that in the NVE
> ensemble, I see the energy increase as shown on the attached image. Red
> line = old code. Black line = new code. I need to run it further to
> verify that the new code isn't any more prone to this than the old code,
> but it doesn't appear that may changes to the precision model are doing
> anything detrimental. Furthermore, the occurrence and the size of the
> jumps are random. Something is weird. I've tried various permutations on
> this--pushing the cutoff up to 8.7A with a 0.6A skin, dialing up the
> reciprocal space settings and cutting the time step down to 0.5fs (from
> 1fs) to remove as much noise as possible. I have still seen one of these
> jumps in that situation, but it's too early as of yet to say whether the
> trend has been affected. If I switch to double precision, I can only go
> 1/10th as fast, and I haven't YET seen a jump, but I'm only out to 28ns so
> it might. Some of you may notice that, in between jumps, there is a slight
> DOWNWARD trend to the energy. If the energy isn't flat, errors of any sort
> should drive the energy UP, we thought, but there may be something about
> the fixed precision rounding that imparts an overall tendency to lose
> energy at a very slow rate (not sure on that yet, but the effect is real).
> I tried making a larger box, one which will have three hash cells in any
> direction, and so far I haven't seen any jumps in the energy but there is
> still that slight downward pressure on the overall energy. If I then take
> that larger box and increase the cutoff, so that there are now only 2x2x2
> hash cells once again (but each o them is highly oversized), I am not (yet)
> seeing any jumps of this sort, but that kind of jibes with my original
> hypothesis given what I know about how the pairlist works. You're going to
> see this problem when the box is really small (< 20 thick) and only two
> hash cells thick.
>
> By tomorrow I'll hopefully have more investigation of the first problem I
> brought up. For now, it looks like I wasn't crying wolf over the second
> one.
>
> Dave
>
>
> On Tue, Jan 16, 2018 at 10:57 AM, Daniel Roe <daniel.r.roe.gmail.com>
> wrote:
>
>> Hi,
>>
>> On Mon, Jan 15, 2018 at 10:11 PM, David Cerutti <dscerutti.gmail.com>
>> wrote:
>> > What's happening is that, for non-orthogonal unit cells (and the
>> > only case the tleap really knows how to construct is the truncated
>> > octahedron, translating it into a parallelepiped with all angles at
>> > 109.47), sander and pmemd do not know how to properly gauge the distance
>> > between the box faces, and therefore the number of hash cells to create
>> > when building their pair lists.
>>
>> I don't think you are right about this. Sander (and cpptraj and I
>> think pmemd as well) determines the number of grid cells in each
>> dimension using the length of the *reciprocal* cell vectors, which is
>> the inverse of the fractional coordinate system. Because this
>> determination is done in reciprocal (fractional) space, it doesn't
>> matter if the unit cell is orthorhombic or not. All of the "gridding"
>> during the pair list build is also done in fractional space, so I
>> think the pair list builds are fine. Maybe I'm not understanding what
>> you're saying the problem is though. Do you have a specific test case
>> where the pair list is not properly set up (i.e. an atom that is
>> within the cutoff of another atom is skipped)?
>>
>> -Dan
>>
>> --
>> -------------------------
>> Daniel R. Roe
>> Laboratory of Computational Biology
>> National Institutes of Health, NHLBI
>> 5635 Fishers Ln, Rm T900
>> Rockville MD, 20852
>> https://www.lobos.nih.gov/lcb
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Jan 17 2018 - 02:00:03 PST