Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG from Kellon Belfon on 2018-11-07 (Amber Developers Archive Nov 2018)

From: Kellon Belfon <kellonbelfon.gmail.com>
Date: Wed, 7 Nov 2018 13:40:10 -0500

Hi Josh,

Thanks for the response. I think in this case it is not because of my
system. We have tested other systems in our lab and all of these systems
give the same overflow notes sometime. This happens during production runs
too and the behavior is random.

The overflow only occurs at initialization when the force array should be
zero. Displaying the values show that the array is mostly zero except for a
sporadic large number in between. This is not always on the same atom and
doing ten runs in triplicate showed the value is zero sometimes and
sometimes it is this large number. I am thinking this might just be garbage
values or old values from previous use of the register. Is this a normal
behavior of the code or have you ever seen this issue before?

Thanks,

Kellon

On Wed, Nov 7, 2018 at 3:13 AM Josh Berryman <
the.real.josh.berryman.gmail.com> wrote:

> Hi Kellon, if you are getting forces in the region 1e29 then your system
> has severe steric clashes in it: the main problem with Lennard-Jones for
> non-bonded interactions is that it diverges quickly for close approach of
> atoms. Probably the answer in that case is to run for 1-10ps with very
> high Langevin coupling and small timestep on the CPU, or to use xmin option
> to pre-stabilise your system (again on the CPU, where the 64 bit datatype
> for floats will give you more headroom against overflows).
>
> google suggests that (as you say you have seen yourself) IEEE_DENORMAL is
> not a problem, it describes cases where a number comes close enough to zero
> that maybe it should just be rounded to zero anyway
>
> Josh
>
>
>
>
> On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com> wrote:
>
> > Hi Everyone,
> >
> > We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster and
> > started getting the following note for our gpu calculations in Amber.
> > Note: The following floating-point exceptions are signalling:
> > IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >
> > >From this previous post (http://archive.ambermd.org/201804/0130.html),
> > the
> > response was pretty much do not worry about them. Does this apply to
> > IEEE_OVERFLOW_FLAG as well?
> > Also running the calculation with pmemd.cuda_DPFP does not produce the
> > underflow note. I was thinking maybe it is from mixing floats and
> doubles?
> >
> > We are getting the overflow note for some of our calculations, but it
> does
> > not affect the results. I also used cuda-gdb using -G -g
> > -ffpe-trap=overflow flags, to stop the code where the overflow occurs. I
> > found that the overflow occurs in gpu_upload_frc(), during the first
> upload
> > of the forces as the system is being initialized on the GPU.
> > Further debugging showed the note occurs when the atm_frc array has
> values
> > that are not zero but instead a large number (atm_frc[i][1] =
> > -1.5739204096161189e+29). I think this large number causes the overflow
> > note since the calculation fails right after (The calculation fails
> because
> > I set the ffpe-trap).
> >
> > I then ran the same calculation ten times (with the ffpe-trap, if there
> is
> > an overflow the calculations will fail) and I get the overflow 4 out of
> 10
> > time. Then I repeat for another 10 times in triplicate (2/10, 5/10, 1/10
> > overflow). It seem like an unpredictable note. For really small numbers,
> > multiplying by the forcescale does the trick but for these large numbers
> it
> > is causing the overflow note and the behavior is unpredictable. Does
> anyone
> > has any advice on this? Should we just ignore, since the results are
> okay?
> >
> > Below are the results for one of the trial:
> > *Run 1:*
> > Program received signal SIGFPE: Floating-point exception - erroneous
> > arithmetic operation.
> > Backtrace for this error:
> > #0 0x7fb6597912da in ???
> > #1 0x7fb659790503 in ???
> > #2 0x7fb658a84f1f in ???
> > #3 0x556a6ac73f76 in gpu_upload_frc_
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> > #4 0x556a6abeb897 in pmemd
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> > #5 0x556a6abecbb3 in main
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> > run_direct.sh: line 19: 12686 Floating point exception(core dumped)
> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> > *Run 2: *
> > Program received signal SIGFPE: Floating-point exception - erroneous
> > arithmetic operation.
> > Backtrace for this error:
> > #0 0x7f8780b802da in ???
> > #1 0x7f8780b7f503 in ???
> > #2 0x7f877fe73f1f in ???
> > #3 0x557f1d950f76 in gpu_upload_frc_
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> > #4 0x557f1d8c8897 in pmemd
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> > #5 0x557f1d8c9bb3 in main
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> > run_direct.sh: line 20: 12691 Floating point exception(core dumped)
> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> > *Run 3:*
> > Note: The following floating-point exceptions are signalling:
> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> > *Run 4:*
> > Note: The following floating-point exceptions are signalling:
> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> > *Run 5:*
> > Note: The following floating-point exceptions are signalling:
> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> > *Run 6:*
> > Program received signal SIGFPE: Floating-point exception - erroneous
> > arithmetic operation.
> > Backtrace for this error:
> > #0 0x7fa7cca442da in ???
> > #1 0x7fa7cca43503 in ???
> > #2 0x7fa7cbd37f1f in ???
> > #3 0x559568d8cf76 in gpu_upload_frc_
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> > #4 0x559568d04897 in pmemd
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> > #5 0x559568d05bb3 in main
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> > run_direct.sh: line 24: 12708 Floating point exception(core dumped)
> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> > *Run 7: *
> > Note: The following floating-point exceptions are signalling:
> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> > *Run 8:*
> > Program received signal SIGFPE: Floating-point exception - erroneous
> > arithmetic operation.
> > Backtrace for this error:
> > #0 0x7f32888232da in ???
> > #1 0x7f3288822503 in ???
> > #2 0x7f3287b16f1f in ???
> > #3 0x55ae7dec2f13 in gpu_upload_frc_
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
> > #4 0x55ae7de3a897 in pmemd
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> > #5 0x55ae7de3bbb3 in main
> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> > run_direct.sh: line 26: 12717 Floating point exception(core dumped)
> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> > *Run 9:*
> > Note: The following floating-point exceptions are signalling:
> > IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> > *Run 10:*
> > Note: The following floating-point exceptions are signalling:
> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> >
> > Thank you!
> >
> > Respectfully,
> >
> > Kellon
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Nov 07 2018 - 11:00:03 PST