Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG

From: Kellon Belfon <kellonbelfon.gmail.com>
Date: Sat, 10 Nov 2018 15:22:29 -0500

Hi Ross,

You raised a very good point on the performance hit if we were initializing
every step. I doubled checked and as far I can tell, gpu_upload_frc is only
called once for setup. I will go ahead and commit this soon.

Thank you.

Respectfully,

Kellon

On Fri, Nov 9, 2018 at 8:23 PM Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Kellon,
>
> I agree I don't see any issues with initializing the array during setup.
> There is a performance hit to initializing arrays when one doesn't need to
> but if it only occurs once during startup it should not be an issue. The
> issue is if such initialization occurs every step then it would be a bad
> idea. The IEEE warning is innocuous but if zeroing the array at startup
> immediately after allocating (I.e. doing it always not just when running
> with MPI) get rids of that warning then it has my vote.
>
> All the best
> Ross
>
> > On Nov 9, 2018, at 14:28, Kellon Belfon <kellonbelfon.gmail.com> wrote:
> >
> > Thank you Josh and Ross for your response, advise and suggestions. I
> > definitely appreciate your time and wisdom.
> >
> > This is what I gathered:
> > (1) atm_frc is allocated,
> > (2) values from atm_frc is uploaded to the GPU to populate pForce array
> > during initial setup,
> > (3) Whenever gpu_download_frc is called (mostly for MPI), then atm_frc is
> > repopulated with the actual forces from pForce array.
> > *gpu_download_frc is not called in my runs
> >
> > But it appears that the garbage values in the atm_frc array get
> overwritten
> > by the actual force during the download_frc call, so it will not affect
> the
> > calculations if atm_frc is not initialize. Therefore it is not really a
> > bug. I was thinking it might be just a good approach to initialize the
> > array to get rid of the overflow note.
> >
> > I ran the amber tests after initializing the atm_frc array to zero and
> they
> > all passed. Also the garbage values in the atm_frc array are values that
> > were previously stored at that memory and not from the current
> > calculations.
> >
> > Respectfully,
> >
> > Kellon
> >
> > On Thu, Nov 8, 2018 at 10:24 AM Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> I am not sure that's a bug. The atm_frc array needs to be initialized to
> >> zero for MPI because it will do an mpi_reduce into that array so not
> every
> >> element will be touched by every core and it will do an addition to the
> >> existing element rather than an assignment. When running in serial
> values
> >> are assigned to each atm_frc element rather than added/reduced and thus
> it
> >> should not be necessary to initialize the array.
> >>
> >> I would trace through the code and find out exactly where values are
> >> stored into the atm_frc array in serial to determine if the
> initialization
> >> is needed.
> >>
> >> All the best
> >> Ross
> >>
> >>> On Nov 8, 2018, at 00:16, Josh Berryman <
> >> the.real.josh.berryman.gmail.com> wrote:
> >>>
> >>>>> I looked at the code and found that atm_frc is only initialized to
> >> zero
> >>> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
> >>> parallel.F90:2146: atm_frc(:,:) = 0.d0).
> >>> Well that looks like a classic memory management bug that you should
> >> commit
> >>> a fix for.
> >>>
> >>> If you are mailing the developer's list then I guess that means you
> have
> >>> gitlab access?
> >>>
> >>> Before committing, maybe send an email to whoever else has been
> >> committing
> >>> the most to pmemd.F90 recently, but basically it looks as if you have
> >>> sorted this out yourself.
> >>>
> >>> Josh
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, 7 Nov 2018 at 23:07, Kellon Belfon <kellonbelfon.gmail.com>
> >> wrote:
> >>>
> >>>> Follow up to my previous email.
> >>>>
> >>>> I set atm_frc array to be zero to test the whether the large value in
> >> the
> >>>> atm_frc array is garbage values/old values.
> >>>>
> >>>> To do this, at line 414 in pmemd.F90 I added the following line
> >>>> atm_frc(:,:) = 0.d0
> >>>>
> >>>> Then I ran the calculations in triplicate of 10 times and I did not
> get
> >> the
> >>>> overflow error anymore. I think this suggests that the large values in
> >> the
> >>>> atm_frc array might be garbage values since atm_frc is not initialized
> >> to
> >>>> be zero?
> >>>> I looked at the code and found that atm_frc is only initialized to
> zero
> >>>> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
> >>>> parallel.F90:2146: atm_frc(:,:) = 0.d0).
> >>>>
> >>>> Also, I compared the mdout files with and without the atm_frc(:,:) =
> >> 0.d0
> >>>> at line 414 in pmemd.F90 and the results are the same for the system I
> >>>> tested. I am not sure if this will break anything else. I can run the
> >> amber
> >>>> standard test case, if this might help?
> >>>>
> >>>> Respectfully,
> >>>>
> >>>> Kellon
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Nov 7, 2018 at 1:40 PM Kellon Belfon <kellonbelfon.gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Josh,
> >>>>>
> >>>>> Thanks for the response. I think in this case it is not because of my
> >>>>> system. We have tested other systems in our lab and all of these
> >> systems
> >>>>> give the same overflow notes sometime. This happens during production
> >>>> runs
> >>>>> too and the behavior is random.
> >>>>>
> >>>>> The overflow only occurs at initialization when the force array
> should
> >> be
> >>>>> zero. Displaying the values show that the array is mostly zero except
> >>>> for a
> >>>>> sporadic large number in between. This is not always on the same atom
> >> and
> >>>>> doing ten runs in triplicate showed the value is zero sometimes and
> >>>>> sometimes it is this large number. I am thinking this might just be
> >>>> garbage
> >>>>> values or old values from previous use of the register. Is this a
> >> normal
> >>>>> behavior of the code or have you ever seen this issue before?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Kellon
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 7, 2018 at 3:13 AM Josh Berryman <
> >>>>> the.real.josh.berryman.gmail.com> wrote:
> >>>>>
> >>>>>> Hi Kellon, if you are getting forces in the region 1e29 then your
> >> system
> >>>>>> has severe steric clashes in it: the main problem with Lennard-Jones
> >> for
> >>>>>> non-bonded interactions is that it diverges quickly for close
> approach
> >>>> of
> >>>>>> atoms. Probably the answer in that case is to run for 1-10ps with
> >> very
> >>>>>> high Langevin coupling and small timestep on the CPU, or to use xmin
> >>>>>> option
> >>>>>> to pre-stabilise your system (again on the CPU, where the 64 bit
> >>>> datatype
> >>>>>> for floats will give you more headroom against overflows).
> >>>>>>
> >>>>>> google suggests that (as you say you have seen yourself)
> >> IEEE_DENORMAL
> >>>> is
> >>>>>> not a problem, it describes cases where a number comes close enough
> to
> >>>>>> zero
> >>>>>> that maybe it should just be rounded to zero anyway
> >>>>>>
> >>>>>> Josh
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Everyone,
> >>>>>>>
> >>>>>>> We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our
> cluster
> >>>>>> and
> >>>>>>> started getting the following note for our gpu calculations in
> Amber.
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>>
> >>>>>>>> From this previous post (
> >> http://archive.ambermd.org/201804/0130.html
> >>>> ),
> >>>>>>> the
> >>>>>>> response was pretty much do not worry about them. Does this apply
> to
> >>>>>>> IEEE_OVERFLOW_FLAG as well?
> >>>>>>> Also running the calculation with pmemd.cuda_DPFP does not produce
> >> the
> >>>>>>> underflow note. I was thinking maybe it is from mixing floats and
> >>>>>> doubles?
> >>>>>>>
> >>>>>>> We are getting the overflow note for some of our calculations, but
> it
> >>>>>> does
> >>>>>>> not affect the results. I also used cuda-gdb using -G -g
> >>>>>>> -ffpe-trap=overflow flags, to stop the code where the overflow
> >>>> occurs. I
> >>>>>>> found that the overflow occurs in gpu_upload_frc(), during the
> first
> >>>>>> upload
> >>>>>>> of the forces as the system is being initialized on the GPU.
> >>>>>>> Further debugging showed the note occurs when the atm_frc array has
> >>>>>> values
> >>>>>>> that are not zero but instead a large number (atm_frc[i][1] =
> >>>>>>> -1.5739204096161189e+29). I think this large number causes the
> >>>> overflow
> >>>>>>> note since the calculation fails right after (The calculation fails
> >>>>>> because
> >>>>>>> I set the ffpe-trap).
> >>>>>>>
> >>>>>>> I then ran the same calculation ten times (with the ffpe-trap, if
> >>>> there
> >>>>>> is
> >>>>>>> an overflow the calculations will fail) and I get the overflow 4
> out
> >>>> of
> >>>>>> 10
> >>>>>>> time. Then I repeat for another 10 times in triplicate (2/10, 5/10,
> >>>> 1/10
> >>>>>>> overflow). It seem like an unpredictable note. For really small
> >>>> numbers,
> >>>>>>> multiplying by the forcescale does the trick but for these large
> >>>>>> numbers it
> >>>>>>> is causing the overflow note and the behavior is unpredictable.
> Does
> >>>>>> anyone
> >>>>>>> has any advice on this? Should we just ignore, since the results
> are
> >>>>>> okay?
> >>>>>>>
> >>>>>>> Below are the results for one of the trial:
> >>>>>>> *Run 1:*
> >>>>>>> Program received signal SIGFPE: Floating-point exception -
> erroneous
> >>>>>>> arithmetic operation.
> >>>>>>> Backtrace for this error:
> >>>>>>> #0 0x7fb6597912da in ???
> >>>>>>> #1 0x7fb659790503 in ???
> >>>>>>> #2 0x7fb658a84f1f in ???
> >>>>>>> #3 0x556a6ac73f76 in gpu_upload_frc_
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> >>>>>>> #4 0x556a6abeb897 in pmemd
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>>>> #5 0x556a6abecbb3 in main
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>>>> run_direct.sh: line 19: 12686 Floating point exception(core dumped)
> >>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>>>> *Run 2: *
> >>>>>>> Program received signal SIGFPE: Floating-point exception -
> erroneous
> >>>>>>> arithmetic operation.
> >>>>>>> Backtrace for this error:
> >>>>>>> #0 0x7f8780b802da in ???
> >>>>>>> #1 0x7f8780b7f503 in ???
> >>>>>>> #2 0x7f877fe73f1f in ???
> >>>>>>> #3 0x557f1d950f76 in gpu_upload_frc_
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> >>>>>>> #4 0x557f1d8c8897 in pmemd
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>>>> #5 0x557f1d8c9bb3 in main
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>>>> run_direct.sh: line 20: 12691 Floating point exception(core dumped)
> >>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>>>> *Run 3:*
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>> *Run 4:*
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>> *Run 5:*
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>> *Run 6:*
> >>>>>>> Program received signal SIGFPE: Floating-point exception -
> erroneous
> >>>>>>> arithmetic operation.
> >>>>>>> Backtrace for this error:
> >>>>>>> #0 0x7fa7cca442da in ???
> >>>>>>> #1 0x7fa7cca43503 in ???
> >>>>>>> #2 0x7fa7cbd37f1f in ???
> >>>>>>> #3 0x559568d8cf76 in gpu_upload_frc_
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> >>>>>>> #4 0x559568d04897 in pmemd
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>>>> #5 0x559568d05bb3 in main
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>>>> run_direct.sh: line 24: 12708 Floating point exception(core dumped)
> >>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>>>> *Run 7: *
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>> *Run 8:*
> >>>>>>> Program received signal SIGFPE: Floating-point exception -
> erroneous
> >>>>>>> arithmetic operation.
> >>>>>>> Backtrace for this error:
> >>>>>>> #0 0x7f32888232da in ???
> >>>>>>> #1 0x7f3288822503 in ???
> >>>>>>> #2 0x7f3287b16f1f in ???
> >>>>>>> #3 0x55ae7dec2f13 in gpu_upload_frc_
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
> >>>>>>> #4 0x55ae7de3a897 in pmemd
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> >>>>>>> #5 0x55ae7de3bbb3 in main
> >>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> >>>>>>> run_direct.sh: line 26: 12717 Floating point exception(core dumped)
> >>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> >>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> >>>>>>> *Run 9:*
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>> *Run 10:*
> >>>>>>> Note: The following floating-point exceptions are signalling:
> >>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >>>>>>>
> >>>>>>> Thank you!
> >>>>>>>
> >>>>>>> Respectfully,
> >>>>>>>
> >>>>>>> Kellon
> >>>>>>> _______________________________________________
> >>>>>>> AMBER-Developers mailing list
> >>>>>>> AMBER-Developers.ambermd.org
> >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> AMBER-Developers mailing list
> >>>>>> AMBER-Developers.ambermd.org
> >>>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Kellon A. A. Belfon, Graduate Student
> >>>> Carlos Simmerling Laboratory
> >>>> The Laufer Center for Physical and Quantitative Biology
> >>>> The Department of Chemistry, Stony Brook University
> >>>> Stony Brook, New York 11794
> >>>> Phone: (347) 546-4237 <(347)+546+4237> Email:
> >> kellon.belfon.stonybrook.
> >>>> <kellon.belfon.stonybrook.edu>edu
> >>>> _______________________________________________
> >>>> AMBER-Developers mailing list
> >>>> AMBER-Developers.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>>
> >>> _______________________________________________
> >>> AMBER-Developers mailing list
> >>> AMBER-Developers.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >>
> >> _______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >
> >
> > --
> > Kellon A. A. Belfon, Graduate Student
> > Carlos Simmerling Laboratory
> > The Laufer Center for Physical and Quantitative Biology
> > The Department of Chemistry, Stony Brook University
> > Stony Brook, New York 11794
> > Phone: (347) 546-4237 <(347)+546+4237> Email: kellon.belfon.stonybrook.
> > <kellon.belfon.stonybrook.edu>edu
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>


-- 
Kellon A. A. Belfon, Graduate Student
Carlos Simmerling Laboratory
The Laufer Center for Physical and Quantitative Biology
The Department of Chemistry, Stony Brook University
Stony Brook, New York 11794
Phone: (347) 546-4237 <(347)+546+4237>  Email:  kellon.belfon.stonybrook.
<kellon.belfon.stonybrook.edu>edu
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Nov 10 2018 - 12:30:03 PST
Custom Search