Re: [AMBER-Developers] Zero VDW radii causing havoc. from Scott Le Grand on 2013-08-17 (Amber Developers Archive Aug 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Sat, 17 Aug 2013 11:45:16 -0700

And I have reproed this on the CPU...

vlimit exceeded for step 0; vmax = 4745.1074

NSTEP = 1 TIME(PS) = 10706.752 TEMP(K) = NaN PRESS
=********
Etot = NaN EKtot = NaN
EPtot = 236869.95967510639457
BOND = 8.22942084594198 ANGLE = 18.89573879915813
DIHED = 7.22106560685035
1-4 NB = 7.67373460336594 1-4 EEL = 253.62692486064071
VDWAALS = 246319.07640994922258
EELEC = -9744.76361955880566 EHBOND = 0.00000000000000
RESTRAINT = 0.00000000000000
EKCMT = 967.38657273648380 VIRIAL = -1491089.48552867863327
VOLUME = 24328.24718473819667
Density =
0.9956
Ewald error estimate: 0.9828E-04
------------------------------------------------------------------------------

Repro data now attached to bug 180. The reason you haven't seen this
before is that the CPU behavior is to seg fault. Don't tell me you've
never seen PMEMD seg fault :-)...

This is what the same $!#%ing looks like on the CPU:

slegrand.amber:~/pmemd/src/b1$ ../pmemd -O
Segmentation fault
slegrand.amber:~/pmemd/src/b1$

On Sat, Aug 17, 2013 at 10:30 AM, Scott Le Grand <varelse2005.gmail.com>wrote:

> PS one more possibility - netfrc matters for small systems running NTP but
> somewhere along the line its impact is lost in the noise of larger
> systems...
>
>
>
>
> On Sat, Aug 17, 2013 at 10:04 AM, Scott Le Grand <varelse2005.gmail.com>wrote:
>
>> So I have root caused why the nans appear: Fast Shake gets overwhelmed.
>>
>> At the point of failure, feed these numbers into fast shake:
>>
>> x0(1, atm_1) = 7.28912097579671414138;
>> x0(1, atm_2) = 7.72543530415168433478;
>> x0(1, atm_3) = 6.44526356757315177504;
>>
>> x0(2, atm_1) = 0.59867993601737978793;
>> x0(2, atm_2) = 28.86782004149711866603 - 29.0969602342;
>> x0(2, atm_3) = 0.53986248962842253718;
>>
>> x0(3, atm_1) = -14.70521127008194284258;
>> x0(3, atm_2) = -14.90664483562995812349;
>> x0(3, atm_3) = -15.15318116034974238460;
>>
>> x1(1, atm_1) = 7.37472831962579089549;
>> x1(1, atm_2) = 11.58546676297256716737;
>> x1(1, atm_3) = 6.48373270122986600228;
>>
>> x1(2, atm_1) = 0.50367624590281323549;
>> x1(2, atm_2) = 25.98966495033111101520 - 29.0969602342;
>> x1(2, atm_3) = 0.44969881379113374464;
>>
>> x1(3, atm_1) = -14.74963003569916608626;
>> x1(3, atm_2) = -15.53454241815419933914;
>> x1(3, atm_3) = -15.08733880938599369870;
>>
>> And what will happen is that
>>
>> al2be2 = 1.599492
>> gama = -1.454111
>>
>> Feed this into line 1491 of shake.F90:
>>
>> sinthe = (alpa*gama - beta * sqrt(al2be2 - gama * gama)) / al2be2
>>
>> and sqrt(al2be2 - gama * gama) becomes sqrt(-0.514948) and that's a NaN.
>> All else follows from this...
>>
>> As to why you haven't seen this elsewhere, some possibilities:
>>
>> 1. You actually have, simulations blow up all the time. How often have
>> you root caused the exact moment of failure rather than just run it again
>> when it's something this rare?
>> 2. Perhaps using vlimit is a sufficient band-aid to prevent this most of
>> the time.
>> 3. The bug exists in every way in the FORTRAN code, but its probability
>> of occurrence is higher when single-precision is in the mix in some subtle
>> way. Because no matter how you slice this, the FORTRAN code has a whole
>> bunch of sqrt operations here that are unguarded from when their operand
>> goes negative.
>> 4. I'm missing something in my shake code.
>> 5. pmemd.cuda is hosed somehow.
>>
>> I personally would lay even odds on all of these. The big mystery for me
>> is why this *seems* to happen more with small systems. It makes me wonder
>> if there is some bizarro boundary effect going on in PBC having to do with
>> the relative surface area of the boundary of the box relative to its volume.
>>
>> Ideas?
>>
>>
>>
>>
>> On Sat, Aug 17, 2013 at 9:12 AM, Carlos Simmerling <
>> carlos.simmerling.gmail.com> wrote:
>>
>>> I've never seen this in 20 years, except in free energy calculations
>>> where
>>> the oxygen is being made smaller. In my opinion its probably a symptom,
>>> not
>>> the problem, unless the simulations are really different from anything
>>> I've
>>> ever done.
>>> On Aug 16, 2013 10:39 PM, "Ross Walker" <ross.rosswalker.co.uk> wrote:
>>>
>>> > So that suggests to me that this could indeed be a bug related to
>>> having a
>>> > small water box - has anyone seen this happen with the CPU code? A
>>> repro
>>> > there would be very telling.
>>> >
>>> > Also, this is NPT yes? As in it is not some NVT water box at some
>>> > artificially high compression?
>>> >
>>> > All the best
>>> > Ross
>>> >
>>> >
>>> >
>>> >
>>> > On 8/16/13 7:33 PM, "B. Lachele Foley" <lfoley.ccrc.uga.edu> wrote:
>>> >
>>> > >I can only add the reminder that our main workaround has been to
>>> increase
>>> > >the size of the water box. That is, it disproportionately seems to
>>> > >affect smaller systems. The locals who run simulations of really
>>> large
>>> > >systems never seem to see this. Sure, they can't run as long, but
>>> there
>>> > >are many more interactions in each step. I suspect most lipid
>>> > >simulations are also much larger than some of the systems we run. I
>>> > >can't promise that this information is useful, but there it is.
>>> > >
>>> > >FWIW, we're going to re-run some simulations that failed, but with
>>> our HO
>>> > >vdW terms on the HW, just to see what happens.
>>> > >
>>> > >We haven't seen it in CPU. I realize that doesn't mean it couldn't
>>> > >happen there.
>>> > >
>>> > >:-) Lachele
>>> > >
>>> > >Dr. B. Lachele Foley
>>> > >Complex Carbohydrate Research Center
>>> > >The University of Georgia
>>> > >Athens, GA USA
>>> > >lfoley.uga.edu
>>> > >http://glycam.ccrc.uga.edu
>>> > >
>>> > >________________________________________
>>> > >From: Ross Walker [ross.rosswalker.co.uk]
>>> > >Sent: Friday, August 16, 2013 10:02 PM
>>> > >To: AMBER Developers Mailing List
>>> > >Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
>>> > >
>>> > >Scott, I'm still not sure I am convinced on this being the issue. We
>>> have
>>> > >many many simulations done with TIP3P water - probably hundreds of
>>> > >milliseconds in aggregate, a lot with boosted potentials and other
>>> types
>>> > >of enhanced sampling and haven't seen this issue. Sure we've seen
>>> problems
>>> > >with zero VDW hydroxyl hydrogens before - but those are always when
>>> they
>>> > >are constrained to be close to each other or with massive attraction
>>> such
>>> > >as in some sugars and phosphates. I don't think I've ever seen
>>> problems
>>> > >with TIP3P before since the bonds are rigid and the H's are, I
>>> believe,
>>> > >inside the VDW sphere of the oxygen. Hence I would never expect to see
>>> > >this for two waters, for a water and a strained hydroxyl perhaps but
>>> even
>>> > >then unlikely.
>>> > >
>>> > >Are you certain this is a real problem and not a strange artifact,
>>> and or
>>> > >subtle bug? - Do we have any repro's for the CPU code?
>>> > >
>>> > >The only place I'd expect there could be problems with TIP3P is if it
>>> is
>>> > >run without shake. So something must be unique about the examples here
>>> > >that are showing the repro that doesn't normally occur in protein
>>> and/or
>>> > >dna simulations.
>>> > >
>>> > >Ps. we've never seen a problem with the lipid12 force field as far as
>>> I
>>> > >know and we have over 100 microsecs of simulation on these systems.
>>> > >
>>> > >Hence I am still a little suspicious...
>>> > >
>>> > >All the best
>>> > >Ross
>>> > >
>>> > >
>>> > >
>>> > >On 8/16/13 6:20 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:
>>> > >
>>> > >>Yep, that's why I moved the bug over to pmemd and away from
>>> pmemd.cuda...
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>On Fri, Aug 16, 2013 at 6:17 PM, B. Lachele Foley
>>> > >><lfoley.ccrc.uga.edu>wrote:
>>> > >>
>>> > >>> We only had one, the HO (Ho) term. It's been made non-zero which
>>> at
>>> > >>>least
>>> > >>> fixed the issue in the highly charged, perfectly aligned situation
>>> > >>>where we
>>> > >>> first saw it. BTW, we (specifically Matt) first observed it on
>>> CPU.
>>> > >>>
>>> > >>> Can changing the LJ A term like that potentially mess up other
>>> things
>>> > >>>like
>>> > >>> hydrogen bonding?
>>> > >>>
>>> > >>> :-) Lachele
>>> > >>>
>>> > >>> Dr. B. Lachele Foley
>>> > >>> Complex Carbohydrate Research Center
>>> > >>> The University of Georgia
>>> > >>> Athens, GA USA
>>> > >>> lfoley.uga.edu
>>> > >>> http://glycam.ccrc.uga.edu
>>> > >>>
>>> > >>> ________________________________________
>>> > >>> From: Scott Le Grand [varelse2005.gmail.com]
>>> > >>> Sent: Friday, August 16, 2013 9:11 PM
>>> > >>> To: AMBER Developers Mailing List
>>> > >>> Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
>>> > >>>
>>> > >>> There are a heck of a lot more TIP3P waters running around your
>>> test
>>> > >>>case.
>>> > >>> Anything with an effective zero VDW radius is a problem.
>>> > >>>
>>> > >>> 20 years ago, I used to see these when I did global conformational
>>> > >>>search
>>> > >>> of proteins. And when it happened, energies would plummet into a
>>> deep
>>> > >>> abyss and hose the search process. The only thing different with
>>> MD is
>>> > >>> that the finite timestep messes up a water badly enough that the
>>> fast
>>> > >>>Shake
>>> > >>> recovers it into a bizarre position that now hits a bunch of other
>>> > >>>atoms
>>> > >>> and then all you-know-what breaks loose and we're off to
>>> NaNdyland...
>>> > >>>
>>> > >>> The more things change...
>>> > >>>
>>> > >>>
>>> > >>> On Fri, Aug 16, 2013 at 6:06 PM, B. Lachele Foley <
>>> lfoley.ccrc.uga.edu
>>> > >>> >wrote:
>>> > >>>
>>> > >>> > Jason said: "The problem-case hydrogens in the relevant
>>> > >>>carbohydrates
>>> > >>> > (and lipid?) force fields should be modified if they represent
>>> > >>>problems
>>> > >>> in
>>> > >>> > normal simulation. "
>>> > >>> >
>>> > >>> > FYI: Scott Le Grand, in the bug report Ross mentioned, just
>>> said it
>>> > >>>was
>>> > >>> > two TIP3P waters doing it.
>>> > >>> >
>>> > >>> > :-) Lachele
>>> > >>> >
>>> > >>> > Dr. B. Lachele Foley
>>> > >>> > Complex Carbohydrate Research Center
>>> > >>> > The University of Georgia
>>> > >>> > Athens, GA USA
>>> > >>> > lfoley.uga.edu
>>> > >>> > http://glycam.ccrc.uga.edu
>>> > >>> >
>>> > >>> > ________________________________________
>>> > >>> > From: Jason Swails [jason.swails.gmail.com]
>>> > >>> > Sent: Friday, August 16, 2013 3:28 PM
>>> > >>> > To: AMBER Developers Mailing List
>>> > >>> > Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
>>> > >>> >
>>> > >>> > On Aug 16, 2013, at 11:39 AM, David A Case <
>>> case.biomaps.rutgers.edu
>>> > >
>>> > >>> > wrote:
>>> > >>> >
>>> > >>> > > On Fri, Aug 16, 2013, Ross Walker wrote:
>>> > >>> > >
>>> > >>> > >> So I am proposing that we have code in rdparm that detects
>>> zero
>>> > >>>VDW
>>> > >>> > radii,
>>> > >>> > >> prints a warning message about it and then adjusts these
>>> radii in
>>> > >>>the
>>> > >>> > code
>>> > >>> > >> to 0.0000001.
>>> > >>> > >
>>> > >>> > > I don't understand how/why this fixes any problem. First, you
>>> > >>>would
>>> > >>> have
>>> > >>> > > to get to distances comparable to the radius for such a term to
>>> > >>>have
>>> > >>> any
>>> > >>> > > effect. You would have already very much messed things up by
>>> that
>>> > >>> point.
>>> > >>> > > Second, atoms like OH have a zero epsilon as well as a zero
>>> radius,
>>> > >>>so
>>> > >>> > > just changing the radius would have no effect.
>>> > >>> >
>>> > >>> > I think Ross meant changing the A and B coefficients to 1E-6 or
>>> so,
>>> > >>>not
>>> > >>> > the pre-combined values.
>>> > >>> >
>>> > >>> > I'm with Dave on this though. I've run many microseconds of net
>>> > >>> simulation
>>> > >>> > and I've never seen a problem with zero vdW terms in HO atom
>>> types. I
>>> > >>> > realize there are some cases where this bug hits you, but as I
>>> recall
>>> > >>>it
>>> > >>> is
>>> > >>> > a strange case with carbohydrates. I'm wary about just making
>>> these
>>> > >>> changes
>>> > >>> > under the assumption it will have only remedial effects. The
>>> > >>>problem-case
>>> > >>> > hydrogens in the relevant carbohydrates (and lipid?) force fields
>>> > >>>should
>>> > >>> be
>>> > >>> > modified if they represent problems in normal simulation.
>>> > >>> >
>>> > >>> > I can also quite easily write a quick parmed action to implement
>>> > >>>these
>>> > >>> > changes without having to hard-code this new convention
>>> immediately
>>> > >>>into
>>> > >>> > every md engine we have.
>>> > >>> >
>>> > >>> > Also, I think that the amber codes should implement the Amber
>>> force
>>> > >>>field
>>> > >>> > exactly as it is defined in the prmtop. Implicitly changing
>>> > >>>parameters in
>>> > >>> > the MD codes is a bad idea, IMO. (How hard would it have been to
>>> > >>>validate
>>> > >>> > chamber if CHARMM took that approach?) So if we decide to
>>> implement
>>> > >>>this
>>> > >>> > change, it should be done in LEaP, not rdparm (but again, I
>>> think the
>>> > >>>LJ
>>> > >>> > terms should be derivable straight from the combining rules).
>>> > >>> >
>>> > >>> > > Note that Istvan's lmodprmtop hack directly changes the LJ A
>>> term
>>> > >>>from
>>> > >>> > > zero to 1000. This has the advantage of not assuming any
>>> > >>>particular
>>> > >>> > mixing
>>> > >>> > > rules, i.e. it still works when NBFIX like changes have been
>>> used.
>>> > >>> >
>>> > >>> > As an aside, should lmodprmtop functionality be incorporated into
>>> > >>>parmed
>>> > >>> > and/or cpptraj if it does simple things? (I'm not sure what else
>>> it
>>> > >>> does...)
>>> > >>> >
>>> > >>> > All the best,
>>> > >>> > Jason
>>> > >>> >
>>> > >>> > _______________________________________________
>>> > >>> > AMBER-Developers mailing list
>>> > >>> > AMBER-Developers.ambermd.org
>>> > >>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> > >>> >
>>> > >>> >
>>> > >>> >
>>> > >>> > _______________________________________________
>>> > >>> > AMBER-Developers mailing list
>>> > >>> > AMBER-Developers.ambermd.org
>>> > >>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> > >>> >
>>> > >>> _______________________________________________
>>> > >>> AMBER-Developers mailing list
>>> > >>> AMBER-Developers.ambermd.org
>>> > >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> _______________________________________________
>>> > >>> AMBER-Developers mailing list
>>> > >>> AMBER-Developers.ambermd.org
>>> > >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> > >>>
>>> > >>_______________________________________________
>>> > >>AMBER-Developers mailing list
>>> > >>AMBER-Developers.ambermd.org
>>> > >>http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> > >
>>> > >
>>> > >
>>> > >_______________________________________________
>>> > >AMBER-Developers mailing list
>>> > >AMBER-Developers.ambermd.org
>>> > >http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> > >
>>> > >
>>> > >
>>> > >_______________________________________________
>>> > >AMBER-Developers mailing list
>>> > >AMBER-Developers.ambermd.org
>>> > >http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > AMBER-Developers mailing list
>>> > AMBER-Developers.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>>> >
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>
>>
>>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Aug 17 2013 - 12:00:03 PDT