Re: [AMBER-Developers] Zero VDW radii causing havoc.

From: Scott Le Grand <varelse2005.gmail.com>
Date: Sat, 17 Aug 2013 12:13:38 -0700

I am building just such a trajectory now, but the catch is that I can only
do it with restrt snapshots today, which I'm about to upload.

On the GPU, the previous two frames are sufficient to cause the blow-up to
occur immediately (step 484372 and 484373) because you're seg-faulting on
the negative square root in Fast Shake. Steps 484371 and 484370 do not
blow up because the heat bath appears to be sufficient to prevent the
ensuing collapse. On the CPU, only restrt484370 escapes the collapse but
check out what happens to the simulation temperature:


 NSTEP = 1 TIME(PS) = 10706.742 TEMP(K) = 328.82 PRESS =
6541.8
 Etot = -5427.37807722944581 EKtot = 1587.52086597083712
EPtot = -7014.89894320028270
 BOND = 11.38406747053108 ANGLE = 18.81709177178301
DIHED = 7.19017644294600
 1-4 NB = 7.66086106809208 1-4 EEL = 255.31785298099652
VDWAALS = 1689.55404001700549
 EELEC = -9004.82303295163729 EHBOND = 0.00000000000000
RESTRAINT = 0.00000000000000
 EKCMT = 749.31401051199646 VIRIAL = -2687.14512591492621
VOLUME = 24329.66160424941336
                                                    Density =
0.9955
 Ewald error estimate: 0.9386E-04
 ------------------------------------------------------------------------------


 NSTEP = 2 TIME(PS) = 10706.744 TEMP(K) = *498.38!!!!* PRESS
= 1767.3
 Etot = -5141.52026163739356 EKtot = 2406.10580973120068
EPtot = -7547.62607136859424
 BOND = 11.62265478788508 ANGLE = 18.81523817167137
DIHED = 7.15938405355967
 1-4 NB = 7.60072657947897 1-4 EEL = 255.06327276872423
VDWAALS = 1128.67474435666304
 EELEC = -8976.56209208657674 EHBOND = 0.00000000000000
RESTRAINT = 0.00000000000000
 EKCMT = 1593.52713719002031 VIRIAL = 664.87852338194170
VOLUME = 24336.75904564626762
                                                    Density =
0.9952
 Ewald error estimate: 0.1583E-04
 ------------------------------------------------------------------------------



On Sat, Aug 17, 2013 at 12:03 PM, B. Lachele Foley <lfoley.ccrc.uga.edu>wrote:

> Did you make those input files or did one of us send them to you? In
> particular, I wonder where the inpcrd came from. If it is from just before
> the blowup, then, yes, totally, it would happen on CPU or anywhere.
>
> I can build something like the system in your attachment but with our Ho
> vdW on the HW waters. Will do that asap.
>
> Would it be possible for you to save a short trajectory of maybe 50 frames
> from just before a blowup. Also let me know which prmtop it goes with (or
> just include it). If saving a trajectory is hard, then just dump the
> coordinates in proper order, and I can reformat. I'd like to take a look
> at the system.
>
> :-) Lachele
>
> Dr. B. Lachele Foley
> Complex Carbohydrate Research Center
> The University of Georgia
> Athens, GA USA
> lfoley.uga.edu
> http://glycam.ccrc.uga.edu
>
> ________________________________________
> From: Scott Le Grand [varelse2005.gmail.com]
> Sent: Saturday, August 17, 2013 2:45 PM
> To: AMBER Developers Mailing List
> Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
>
> And I have reproed this on the CPU...
>
> vlimit exceeded for step 0; vmax = 4745.1074
>
> NSTEP = 1 TIME(PS) = 10706.752 TEMP(K) = NaN PRESS
> =********
> Etot = NaN EKtot = NaN
> EPtot = 236869.95967510639457
> BOND = 8.22942084594198 ANGLE = 18.89573879915813
> DIHED = 7.22106560685035
> 1-4 NB = 7.67373460336594 1-4 EEL = 253.62692486064071
> VDWAALS = 246319.07640994922258
> EELEC = -9744.76361955880566 EHBOND = 0.00000000000000
> RESTRAINT = 0.00000000000000
> EKCMT = 967.38657273648380 VIRIAL = -1491089.48552867863327
> VOLUME = 24328.24718473819667
> Density =
> 0.9956
> Ewald error estimate: 0.9828E-04
>
> ------------------------------------------------------------------------------
>
> Repro data now attached to bug 180. The reason you haven't seen this
> before is that the CPU behavior is to seg fault. Don't tell me you've
> never seen PMEMD seg fault :-)...
>
> This is what the same $!#%ing looks like on the CPU:
>
> slegrand.amber:~/pmemd/src/b1$ ../pmemd -O
> Segmentation fault
> slegrand.amber:~/pmemd/src/b1$
>
>
>
>
>
> On Sat, Aug 17, 2013 at 10:30 AM, Scott Le Grand <varelse2005.gmail.com
> >wrote:
>
> > PS one more possibility - netfrc matters for small systems running NTP
> but
> > somewhere along the line its impact is lost in the noise of larger
> > systems...
> >
> >
> >
> >
> > On Sat, Aug 17, 2013 at 10:04 AM, Scott Le Grand <varelse2005.gmail.com
> >wrote:
> >
> >> So I have root caused why the nans appear: Fast Shake gets overwhelmed.
> >>
> >> At the point of failure, feed these numbers into fast shake:
> >>
> >> x0(1, atm_1) = 7.28912097579671414138;
> >> x0(1, atm_2) = 7.72543530415168433478;
> >> x0(1, atm_3) = 6.44526356757315177504;
> >>
> >> x0(2, atm_1) = 0.59867993601737978793;
> >> x0(2, atm_2) = 28.86782004149711866603 - 29.0969602342;
> >> x0(2, atm_3) = 0.53986248962842253718;
> >>
> >> x0(3, atm_1) = -14.70521127008194284258;
> >> x0(3, atm_2) = -14.90664483562995812349;
> >> x0(3, atm_3) = -15.15318116034974238460;
> >>
> >> x1(1, atm_1) = 7.37472831962579089549;
> >> x1(1, atm_2) = 11.58546676297256716737;
> >> x1(1, atm_3) = 6.48373270122986600228;
> >>
> >> x1(2, atm_1) = 0.50367624590281323549;
> >> x1(2, atm_2) = 25.98966495033111101520 - 29.0969602342;
> >> x1(2, atm_3) = 0.44969881379113374464;
> >>
> >> x1(3, atm_1) = -14.74963003569916608626;
> >> x1(3, atm_2) = -15.53454241815419933914;
> >> x1(3, atm_3) = -15.08733880938599369870;
> >>
> >> And what will happen is that
> >>
> >> al2be2 = 1.599492
> >> gama = -1.454111
> >>
> >> Feed this into line 1491 of shake.F90:
> >>
> >> sinthe = (alpa*gama - beta * sqrt(al2be2 - gama * gama)) / al2be2
> >>
> >> and sqrt(al2be2 - gama * gama) becomes sqrt(-0.514948) and that's a NaN.
> >> All else follows from this...
> >>
> >> As to why you haven't seen this elsewhere, some possibilities:
> >>
> >> 1. You actually have, simulations blow up all the time. How often have
> >> you root caused the exact moment of failure rather than just run it
> again
> >> when it's something this rare?
> >> 2. Perhaps using vlimit is a sufficient band-aid to prevent this most of
> >> the time.
> >> 3. The bug exists in every way in the FORTRAN code, but its probability
> >> of occurrence is higher when single-precision is in the mix in some
> subtle
> >> way. Because no matter how you slice this, the FORTRAN code has a whole
> >> bunch of sqrt operations here that are unguarded from when their operand
> >> goes negative.
> >> 4. I'm missing something in my shake code.
> >> 5. pmemd.cuda is hosed somehow.
> >>
> >> I personally would lay even odds on all of these. The big mystery for
> me
> >> is why this *seems* to happen more with small systems. It makes me
> wonder
> >> if there is some bizarro boundary effect going on in PBC having to do
> with
> >> the relative surface area of the boundary of the box relative to its
> volume.
> >>
> >> Ideas?
> >>
> >>
> >>
> >>
> >> On Sat, Aug 17, 2013 at 9:12 AM, Carlos Simmerling <
> >> carlos.simmerling.gmail.com> wrote:
> >>
> >>> I've never seen this in 20 years, except in free energy calculations
> >>> where
> >>> the oxygen is being made smaller. In my opinion its probably a symptom,
> >>> not
> >>> the problem, unless the simulations are really different from anything
> >>> I've
> >>> ever done.
> >>> On Aug 16, 2013 10:39 PM, "Ross Walker" <ross.rosswalker.co.uk> wrote:
> >>>
> >>> > So that suggests to me that this could indeed be a bug related to
> >>> having a
> >>> > small water box - has anyone seen this happen with the CPU code? A
> >>> repro
> >>> > there would be very telling.
> >>> >
> >>> > Also, this is NPT yes? As in it is not some NVT water box at some
> >>> > artificially high compression?
> >>> >
> >>> > All the best
> >>> > Ross
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On 8/16/13 7:33 PM, "B. Lachele Foley" <lfoley.ccrc.uga.edu> wrote:
> >>> >
> >>> > >I can only add the reminder that our main workaround has been to
> >>> increase
> >>> > >the size of the water box. That is, it disproportionately seems to
> >>> > >affect smaller systems. The locals who run simulations of really
> >>> large
> >>> > >systems never seem to see this. Sure, they can't run as long, but
> >>> there
> >>> > >are many more interactions in each step. I suspect most lipid
> >>> > >simulations are also much larger than some of the systems we run. I
> >>> > >can't promise that this information is useful, but there it is.
> >>> > >
> >>> > >FWIW, we're going to re-run some simulations that failed, but with
> >>> our HO
> >>> > >vdW terms on the HW, just to see what happens.
> >>> > >
> >>> > >We haven't seen it in CPU. I realize that doesn't mean it couldn't
> >>> > >happen there.
> >>> > >
> >>> > >:-) Lachele
> >>> > >
> >>> > >Dr. B. Lachele Foley
> >>> > >Complex Carbohydrate Research Center
> >>> > >The University of Georgia
> >>> > >Athens, GA USA
> >>> > >lfoley.uga.edu
> >>> > >http://glycam.ccrc.uga.edu
> >>> > >
> >>> > >________________________________________
> >>> > >From: Ross Walker [ross.rosswalker.co.uk]
> >>> > >Sent: Friday, August 16, 2013 10:02 PM
> >>> > >To: AMBER Developers Mailing List
> >>> > >Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
> >>> > >
> >>> > >Scott, I'm still not sure I am convinced on this being the issue. We
> >>> have
> >>> > >many many simulations done with TIP3P water - probably hundreds of
> >>> > >milliseconds in aggregate, a lot with boosted potentials and other
> >>> types
> >>> > >of enhanced sampling and haven't seen this issue. Sure we've seen
> >>> problems
> >>> > >with zero VDW hydroxyl hydrogens before - but those are always when
> >>> they
> >>> > >are constrained to be close to each other or with massive attraction
> >>> such
> >>> > >as in some sugars and phosphates. I don't think I've ever seen
> >>> problems
> >>> > >with TIP3P before since the bonds are rigid and the H's are, I
> >>> believe,
> >>> > >inside the VDW sphere of the oxygen. Hence I would never expect to
> see
> >>> > >this for two waters, for a water and a strained hydroxyl perhaps but
> >>> even
> >>> > >then unlikely.
> >>> > >
> >>> > >Are you certain this is a real problem and not a strange artifact,
> >>> and or
> >>> > >subtle bug? - Do we have any repro's for the CPU code?
> >>> > >
> >>> > >The only place I'd expect there could be problems with TIP3P is if
> it
> >>> is
> >>> > >run without shake. So something must be unique about the examples
> here
> >>> > >that are showing the repro that doesn't normally occur in protein
> >>> and/or
> >>> > >dna simulations.
> >>> > >
> >>> > >Ps. we've never seen a problem with the lipid12 force field as far
> as
> >>> I
> >>> > >know and we have over 100 microsecs of simulation on these systems.
> >>> > >
> >>> > >Hence I am still a little suspicious...
> >>> > >
> >>> > >All the best
> >>> > >Ross
> >>> > >
> >>> > >
> >>> > >
> >>> > >On 8/16/13 6:20 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:
> >>> > >
> >>> > >>Yep, that's why I moved the bug over to pmemd and away from
> >>> pmemd.cuda...
> >>> > >>
> >>> > >>
> >>> > >>
> >>> > >>
> >>> > >>On Fri, Aug 16, 2013 at 6:17 PM, B. Lachele Foley
> >>> > >><lfoley.ccrc.uga.edu>wrote:
> >>> > >>
> >>> > >>> We only had one, the HO (Ho) term. It's been made non-zero which
> >>> at
> >>> > >>>least
> >>> > >>> fixed the issue in the highly charged, perfectly aligned
> situation
> >>> > >>>where we
> >>> > >>> first saw it. BTW, we (specifically Matt) first observed it on
> >>> CPU.
> >>> > >>>
> >>> > >>> Can changing the LJ A term like that potentially mess up other
> >>> things
> >>> > >>>like
> >>> > >>> hydrogen bonding?
> >>> > >>>
> >>> > >>> :-) Lachele
> >>> > >>>
> >>> > >>> Dr. B. Lachele Foley
> >>> > >>> Complex Carbohydrate Research Center
> >>> > >>> The University of Georgia
> >>> > >>> Athens, GA USA
> >>> > >>> lfoley.uga.edu
> >>> > >>> http://glycam.ccrc.uga.edu
> >>> > >>>
> >>> > >>> ________________________________________
> >>> > >>> From: Scott Le Grand [varelse2005.gmail.com]
> >>> > >>> Sent: Friday, August 16, 2013 9:11 PM
> >>> > >>> To: AMBER Developers Mailing List
> >>> > >>> Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
> >>> > >>>
> >>> > >>> There are a heck of a lot more TIP3P waters running around your
> >>> test
> >>> > >>>case.
> >>> > >>> Anything with an effective zero VDW radius is a problem.
> >>> > >>>
> >>> > >>> 20 years ago, I used to see these when I did global
> conformational
> >>> > >>>search
> >>> > >>> of proteins. And when it happened, energies would plummet into a
> >>> deep
> >>> > >>> abyss and hose the search process. The only thing different with
> >>> MD is
> >>> > >>> that the finite timestep messes up a water badly enough that the
> >>> fast
> >>> > >>>Shake
> >>> > >>> recovers it into a bizarre position that now hits a bunch of
> other
> >>> > >>>atoms
> >>> > >>> and then all you-know-what breaks loose and we're off to
> >>> NaNdyland...
> >>> > >>>
> >>> > >>> The more things change...
> >>> > >>>
> >>> > >>>
> >>> > >>> On Fri, Aug 16, 2013 at 6:06 PM, B. Lachele Foley <
> >>> lfoley.ccrc.uga.edu
> >>> > >>> >wrote:
> >>> > >>>
> >>> > >>> > Jason said: "The problem-case hydrogens in the relevant
> >>> > >>>carbohydrates
> >>> > >>> > (and lipid?) force fields should be modified if they represent
> >>> > >>>problems
> >>> > >>> in
> >>> > >>> > normal simulation. "
> >>> > >>> >
> >>> > >>> > FYI: Scott Le Grand, in the bug report Ross mentioned, just
> >>> said it
> >>> > >>>was
> >>> > >>> > two TIP3P waters doing it.
> >>> > >>> >
> >>> > >>> > :-) Lachele
> >>> > >>> >
> >>> > >>> > Dr. B. Lachele Foley
> >>> > >>> > Complex Carbohydrate Research Center
> >>> > >>> > The University of Georgia
> >>> > >>> > Athens, GA USA
> >>> > >>> > lfoley.uga.edu
> >>> > >>> > http://glycam.ccrc.uga.edu
> >>> > >>> >
> >>> > >>> > ________________________________________
> >>> > >>> > From: Jason Swails [jason.swails.gmail.com]
> >>> > >>> > Sent: Friday, August 16, 2013 3:28 PM
> >>> > >>> > To: AMBER Developers Mailing List
> >>> > >>> > Subject: Re: [AMBER-Developers] Zero VDW radii causing havoc.
> >>> > >>> >
> >>> > >>> > On Aug 16, 2013, at 11:39 AM, David A Case <
> >>> case.biomaps.rutgers.edu
> >>> > >
> >>> > >>> > wrote:
> >>> > >>> >
> >>> > >>> > > On Fri, Aug 16, 2013, Ross Walker wrote:
> >>> > >>> > >
> >>> > >>> > >> So I am proposing that we have code in rdparm that detects
> >>> zero
> >>> > >>>VDW
> >>> > >>> > radii,
> >>> > >>> > >> prints a warning message about it and then adjusts these
> >>> radii in
> >>> > >>>the
> >>> > >>> > code
> >>> > >>> > >> to 0.0000001.
> >>> > >>> > >
> >>> > >>> > > I don't understand how/why this fixes any problem. First,
> you
> >>> > >>>would
> >>> > >>> have
> >>> > >>> > > to get to distances comparable to the radius for such a term
> to
> >>> > >>>have
> >>> > >>> any
> >>> > >>> > > effect. You would have already very much messed things up by
> >>> that
> >>> > >>> point.
> >>> > >>> > > Second, atoms like OH have a zero epsilon as well as a zero
> >>> radius,
> >>> > >>>so
> >>> > >>> > > just changing the radius would have no effect.
> >>> > >>> >
> >>> > >>> > I think Ross meant changing the A and B coefficients to 1E-6 or
> >>> so,
> >>> > >>>not
> >>> > >>> > the pre-combined values.
> >>> > >>> >
> >>> > >>> > I'm with Dave on this though. I've run many microseconds of net
> >>> > >>> simulation
> >>> > >>> > and I've never seen a problem with zero vdW terms in HO atom
> >>> types. I
> >>> > >>> > realize there are some cases where this bug hits you, but as I
> >>> recall
> >>> > >>>it
> >>> > >>> is
> >>> > >>> > a strange case with carbohydrates. I'm wary about just making
> >>> these
> >>> > >>> changes
> >>> > >>> > under the assumption it will have only remedial effects. The
> >>> > >>>problem-case
> >>> > >>> > hydrogens in the relevant carbohydrates (and lipid?) force
> fields
> >>> > >>>should
> >>> > >>> be
> >>> > >>> > modified if they represent problems in normal simulation.
> >>> > >>> >
> >>> > >>> > I can also quite easily write a quick parmed action to
> implement
> >>> > >>>these
> >>> > >>> > changes without having to hard-code this new convention
> >>> immediately
> >>> > >>>into
> >>> > >>> > every md engine we have.
> >>> > >>> >
> >>> > >>> > Also, I think that the amber codes should implement the Amber
> >>> force
> >>> > >>>field
> >>> > >>> > exactly as it is defined in the prmtop. Implicitly changing
> >>> > >>>parameters in
> >>> > >>> > the MD codes is a bad idea, IMO. (How hard would it have been
> to
> >>> > >>>validate
> >>> > >>> > chamber if CHARMM took that approach?) So if we decide to
> >>> implement
> >>> > >>>this
> >>> > >>> > change, it should be done in LEaP, not rdparm (but again, I
> >>> think the
> >>> > >>>LJ
> >>> > >>> > terms should be derivable straight from the combining rules).
> >>> > >>> >
> >>> > >>> > > Note that Istvan's lmodprmtop hack directly changes the LJ A
> >>> term
> >>> > >>>from
> >>> > >>> > > zero to 1000. This has the advantage of not assuming any
> >>> > >>>particular
> >>> > >>> > mixing
> >>> > >>> > > rules, i.e. it still works when NBFIX like changes have been
> >>> used.
> >>> > >>> >
> >>> > >>> > As an aside, should lmodprmtop functionality be incorporated
> into
> >>> > >>>parmed
> >>> > >>> > and/or cpptraj if it does simple things? (I'm not sure what
> else
> >>> it
> >>> > >>> does...)
> >>> > >>> >
> >>> > >>> > All the best,
> >>> > >>> > Jason
> >>> > >>> >
> >>> > >>> > _______________________________________________
> >>> > >>> > AMBER-Developers mailing list
> >>> > >>> > AMBER-Developers.ambermd.org
> >>> > >>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> > >>> >
> >>> > >>> >
> >>> > >>> >
> >>> > >>> > _______________________________________________
> >>> > >>> > AMBER-Developers mailing list
> >>> > >>> > AMBER-Developers.ambermd.org
> >>> > >>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> > >>> >
> >>> > >>> _______________________________________________
> >>> > >>> AMBER-Developers mailing list
> >>> > >>> AMBER-Developers.ambermd.org
> >>> > >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> > >>>
> >>> > >>>
> >>> > >>>
> >>> > >>> _______________________________________________
> >>> > >>> AMBER-Developers mailing list
> >>> > >>> AMBER-Developers.ambermd.org
> >>> > >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> > >>>
> >>> > >>_______________________________________________
> >>> > >>AMBER-Developers mailing list
> >>> > >>AMBER-Developers.ambermd.org
> >>> > >>http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> > >
> >>> > >
> >>> > >
> >>> > >_______________________________________________
> >>> > >AMBER-Developers mailing list
> >>> > >AMBER-Developers.ambermd.org
> >>> > >http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> > >
> >>> > >
> >>> > >
> >>> > >_______________________________________________
> >>> > >AMBER-Developers mailing list
> >>> > >AMBER-Developers.ambermd.org
> >>> > >http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> >
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > AMBER-Developers mailing list
> >>> > AMBER-Developers.ambermd.org
> >>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>> >
> >>> _______________________________________________
> >>> AMBER-Developers mailing list
> >>> AMBER-Developers.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>
> >>
> >>
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Aug 17 2013 - 12:30:03 PDT
Custom Search