Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG from Kellon Belfon on 2018-11-09 (Amber Developers Archive Nov 2018)

From: Kellon Belfon <kellonbelfon.gmail.com>
Date: Fri, 9 Nov 2018 14:28:07 -0500

Thank you Josh and Ross for your response, advise and suggestions. I
definitely appreciate your time and wisdom.

This is what I gathered:
(1) atm_frc is allocated,
(2) values from atm_frc is uploaded to the GPU to populate pForce array
during initial setup,
(3) Whenever gpu_download_frc is called (mostly for MPI), then atm_frc is
repopulated with the actual forces from pForce array.
*gpu_download_frc is not called in my runs

But it appears that the garbage values in the atm_frc array get overwritten
by the actual force during the download_frc call, so it will not affect the
calculations if atm_frc is not initialize. Therefore it is not really a
bug. I was thinking it might be just a good approach to initialize the
array to get rid of the overflow note.

I ran the amber tests after initializing the atm_frc array to zero and they
all passed. Also the garbage values in the atm_frc array are values that
were previously stored at that memory and not from the current
calculations.

Respectfully,

Kellon

On Thu, Nov 8, 2018 at 10:24 AM Ross Walker <ross.rosswalker.co.uk> wrote:

> I am not sure that's a bug. The atm_frc array needs to be initialized to
> zero for MPI because it will do an mpi_reduce into that array so not every
> element will be touched by every core and it will do an addition to the
> existing element rather than an assignment. When running in serial values
> are assigned to each atm_frc element rather than added/reduced and thus it
> should not be necessary to initialize the array.
>
> I would trace through the code and find out exactly where values are
> stored into the atm_frc array in serial to determine if the initialization
> is needed.
>
> All the best
> Ross
>
> > On Nov 8, 2018, at 00:16, Josh Berryman <
> the.real.josh.berryman.gmail.com> wrote:
> >
> >>> I looked at the code and found that atm_frc is only initialized to
> zero
> > when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
> > parallel.F90:2146: atm_frc(:,:) = 0.d0).
> > Well that looks like a classic memory management bug that you should
> commit
> > a fix for.
> >
> > If you are mailing the developer's list then I guess that means you have
> > gitlab access?
> >
> > Before committing, maybe send an email to whoever else has been
> committing
> > the most to pmemd.F90 recently, but basically it looks as if you have
> > sorted this out yourself.
> >
> > Josh
> >
> >
> >
> >
> >
> > On Wed, 7 Nov 2018 at 23:07, Kellon Belfon <kellonbelfon.gmail.com>
> wrote:
> >
> >> Follow up to my previous email.
> >>
> >> I set atm_frc array to be zero to test the whether the large value in
> the
> >> atm_frc array is garbage values/old values.
> >>
> >> To do this, at line 414 in pmemd.F90 I added the following line
> >> atm_frc(:,:) = 0.d0
> >>
> >> Then I ran the calculations in triplicate of 10 times and I did not get
> the
> >> overflow error anymore. I think this suggests that the large values in
> the
> >> atm_frc array might be garbage values since atm_frc is not initialized
> to
> >> be zero?
> >> I looked at the code and found that atm_frc is only initialized to zero
> >> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
> >> parallel.F90:2146: atm_frc(:,:) = 0.d0).
> >>
> >> Also, I compared the mdout files with and without the atm_frc(:,:) =
> 0.d0
> >> at line 414 in pmemd.F90 and the results are the same for the system I
> >> tested. I am not sure if this will break anything else. I can run the
> amber
> >> standard test case, if this might help?
> >>
> >> Respectfully,
> >>
> >> Kellon
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Nov 7, 2018 at 1:40 PM Kellon Belfon <kellonbelfon.gmail.com>
> >> wrote:
> >>
> >>> Hi Josh,
> >>>
> >>> Thanks for the response. I think in this case it is not because of my
> >>> system. We have tested other systems in our lab and all of these
> systems
> >>> give the same overflow notes sometime. This happens during production
> >> runs
> >>> too and the behavior is random.
> >>>
> >>> The overflow only occurs at initialization when the force array should
> be
> >>> zero. Displaying the values show that the array is mostly zero except
> >> for a
> >>> sporadic large number in between. This is not always on the same atom
> and
> >>> doing ten runs in triplicate showed the value is zero sometimes and
> >>> sometimes it is this large number. I am thinking this might just be
> >> garbage
> >>> values or old values from previous use of the register. Is this a
> normal
> >>> behavior of the code or have you ever seen this issue before?
> >>>
> >>> Thanks,
> >>>
> >>> Kellon
> >>>
> >>>
> >>> On Wed, Nov 7, 2018 at 3:13 AM Josh Berryman <
> >>> the.real.josh.berryman.gmail.com> wrote:
> >>>
> >>>> Hi Kellon, if you are getting forces in the region 1e29 then your
> system
> >>>> has severe steric clashes in it: the main problem with Lennard-Jones
> for
> >>>> non-bonded interactions is that it diverges quickly for close approach
> >> of
> >>>> atoms. Probably the answer in that case is to run for 1-10ps with
> very
> >>>> high Langevin coupling and small timestep on the CPU, or to use xmin
> >>>> option
> >>>> to pre-stabilise your system (again on the CPU, where the 64 bit
> >> datatype
> >>>> for floats will give you more headroom against overflows).
> >>>>
> >>>> google suggests that (as you say you have seen yourself)
> IEEE_DENORMAL
> >> is
> >>>> not a problem, it describes cases where a number comes close enough to
> >>>> zero
> >>>> that maybe it should just be rounded to zero anyway
> >>>>
> >>>> Josh
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Everyone,
> >>>>>
> >>>>> We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster
> >>>> and
> >>>>> started getting the following note for our gpu calculations in Amber.
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>
> >>>>>> From this previous post (
> http://archive.ambermd.org/201804/0130.html
> >> ),
> >>>>> the
> >>>>> response was pretty much do not worry about them. Does this apply to
> >>>>> IEEE_OVERFLOW_FLAG as well?
> >>>>> Also running the calculation with pmemd.cuda_DPFP does not produce
> the
> >>>>> underflow note. I was thinking maybe it is from mixing floats and
> >>>> doubles?
> >>>>>
> >>>>> We are getting the overflow note for some of our calculations, but it
> >>>> does
> >>>>> not affect the results. I also used cuda-gdb using -G -g
> >>>>> -ffpe-trap=overflow flags, to stop the code where the overflow
> >> occurs. I
> >>>>> found that the overflow occurs in gpu_upload_frc(), during the first
> >>>> upload
> >>>>> of the forces as the system is being initialized on the GPU.
> >>>>> Further debugging showed the note occurs when the atm_frc array has
> >>>> values
> >>>>> that are not zero but instead a large number (atm_frc[i][1] =
> >>>>> -1.5739204096161189e+29). I think this large number causes the
> >> overflow
> >>>>> note since the calculation fails right after (The calculation fails
> >>>> because
> >>>>> I set the ffpe-trap).
> >>>>>
> >>>>> I then ran the same calculation ten times (with the ffpe-trap, if
> >> there
> >>>> is
> >>>>> an overflow the calculations will fail) and I get the overflow 4 out
> >> of
> >>>> 10
> >>>>> time. Then I repeat for another 10 times in triplicate (2/10, 5/10,
> >> 1/10
> >>>>> overflow). It seem like an unpredictable note. For really small
> >> numbers,
> >>>>> multiplying by the forcescale does the trick but for these large
> >>>> numbers it
> >>>>> is causing the overflow note and the behavior is unpredictable. Does
> >>>> anyone
> >>>>> has any advice on this? Should we just ignore, since the results are
> >>>> okay?
> >>>>>
> >>>>> Below are the results for one of the trial:
> >>>>> *Run 1:*
> >>>>> Program received signal SIGFPE: Floating-point exception - erroneous
> >>>>> arithmetic operation.
> >>>>> Backtrace for this error:
> >>>>> #0 0x7fb6597912da in ???
> >>>>> #1 0x7fb659790503 in ???
> >>>>> #2 0x7fb658a84f1f in ???
> >>>>> #3 0x556a6ac73f76 in gpu_upload_frc_
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> >>>>> #4 0x556a6abeb897 in pmemd
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>> #5 0x556a6abecbb3 in main
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>> run_direct.sh: line 19: 12686 Floating point exception(core dumped)
> >>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>> *Run 2: *
> >>>>> Program received signal SIGFPE: Floating-point exception - erroneous
> >>>>> arithmetic operation.
> >>>>> Backtrace for this error:
> >>>>> #0 0x7f8780b802da in ???
> >>>>> #1 0x7f8780b7f503 in ???
> >>>>> #2 0x7f877fe73f1f in ???
> >>>>> #3 0x557f1d950f76 in gpu_upload_frc_
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> >>>>> #4 0x557f1d8c8897 in pmemd
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>> #5 0x557f1d8c9bb3 in main
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>> run_direct.sh: line 20: 12691 Floating point exception(core dumped)
> >>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>> *Run 3:*
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>> *Run 4:*
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>> *Run 5:*
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>> *Run 6:*
> >>>>> Program received signal SIGFPE: Floating-point exception - erroneous
> >>>>> arithmetic operation.
> >>>>> Backtrace for this error:
> >>>>> #0 0x7fa7cca442da in ???
> >>>>> #1 0x7fa7cca43503 in ???
> >>>>> #2 0x7fa7cbd37f1f in ???
> >>>>> #3 0x559568d8cf76 in gpu_upload_frc_
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> >>>>> #4 0x559568d04897 in pmemd
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>> #5 0x559568d05bb3 in main
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>> run_direct.sh: line 24: 12708 Floating point exception(core dumped)
> >>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>> *Run 7: *
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>> *Run 8:*
> >>>>> Program received signal SIGFPE: Floating-point exception - erroneous
> >>>>> arithmetic operation.
> >>>>> Backtrace for this error:
> >>>>> #0 0x7f32888232da in ???
> >>>>> #1 0x7f3288822503 in ???
> >>>>> #2 0x7f3287b16f1f in ???
> >>>>> #3 0x55ae7dec2f13 in gpu_upload_frc_
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
> >>>>> #4 0x55ae7de3a897 in pmemd
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>> #5 0x55ae7de3bbb3 in main
> >>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>> run_direct.sh: line 26: 12717 Floating point exception(core dumped)
> >>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>> *Run 9:*
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>> *Run 10:*
> >>>>> Note: The following floating-point exceptions are signalling:
> >>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>
> >>>>> Thank you!
> >>>>>
> >>>>> Respectfully,
> >>>>>
> >>>>> Kellon
> >>>>> _______________________________________________
> >>>>> AMBER-Developers mailing list
> >>>>> AMBER-Developers.ambermd.org
> >>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>>>
> >>>> _______________________________________________
> >>>> AMBER-Developers mailing list
> >>>> AMBER-Developers.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>>
> >>>
> >>
> >> --
> >> Kellon A. A. Belfon, Graduate Student
> >> Carlos Simmerling Laboratory
> >> The Laufer Center for Physical and Quantitative Biology
> >> The Department of Chemistry, Stony Brook University
> >> Stony Brook, New York 11794
> >> Phone: (347) 546-4237 <(347)+546+4237> Email:
> kellon.belfon.stonybrook.
> >> <kellon.belfon.stonybrook.edu>edu
> >> _______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>

-- 
Kellon A. A. Belfon, Graduate Student
Carlos Simmerling Laboratory
The Laufer Center for Physical and Quantitative Biology
The Department of Chemistry, Stony Brook University
Stony Brook, New York 11794
Phone: (347) 546-4237 <(347)+546+4237>  Email:  kellon.belfon.stonybrook.
<kellon.belfon.stonybrook.edu>edu
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers

Received on Fri Nov 09 2018 - 11:30:02 PST