Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG

From: Josh Berryman <the.real.josh.berryman.gmail.com>
Date: Wed, 7 Nov 2018 09:12:56 +0100

Hi Kellon, if you are getting forces in the region 1e29 then your system
has severe steric clashes in it: the main problem with Lennard-Jones for
non-bonded interactions is that it diverges quickly for close approach of
atoms. Probably the answer in that case is to run for 1-10ps with very
high Langevin coupling and small timestep on the CPU, or to use xmin option
to pre-stabilise your system (again on the CPU, where the 64 bit datatype
for floats will give you more headroom against overflows).

google suggests that (as you say you have seen yourself) IEEE_DENORMAL is
not a problem, it describes cases where a number comes close enough to zero
that maybe it should just be rounded to zero anyway

Josh




On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com> wrote:

> Hi Everyone,
>
> We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster and
> started getting the following note for our gpu calculations in Amber.
> Note: The following floating-point exceptions are signalling:
> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>
> >From this previous post (http://archive.ambermd.org/201804/0130.html),
> the
> response was pretty much do not worry about them. Does this apply to
> IEEE_OVERFLOW_FLAG as well?
> Also running the calculation with pmemd.cuda_DPFP does not produce the
> underflow note. I was thinking maybe it is from mixing floats and doubles?
>
> We are getting the overflow note for some of our calculations, but it does
> not affect the results. I also used cuda-gdb using -G -g
> -ffpe-trap=overflow flags, to stop the code where the overflow occurs. I
> found that the overflow occurs in gpu_upload_frc(), during the first upload
> of the forces as the system is being initialized on the GPU.
> Further debugging showed the note occurs when the atm_frc array has values
> that are not zero but instead a large number (atm_frc[i][1] =
> -1.5739204096161189e+29). I think this large number causes the overflow
> note since the calculation fails right after (The calculation fails because
> I set the ffpe-trap).
>
> I then ran the same calculation ten times (with the ffpe-trap, if there is
> an overflow the calculations will fail) and I get the overflow 4 out of 10
> time. Then I repeat for another 10 times in triplicate (2/10, 5/10, 1/10
> overflow). It seem like an unpredictable note. For really small numbers,
> multiplying by the forcescale does the trick but for these large numbers it
> is causing the overflow note and the behavior is unpredictable. Does anyone
> has any advice on this? Should we just ignore, since the results are okay?
>
> Below are the results for one of the trial:
> *Run 1:*
> Program received signal SIGFPE: Floating-point exception - erroneous
> arithmetic operation.
> Backtrace for this error:
> #0 0x7fb6597912da in ???
> #1 0x7fb659790503 in ???
> #2 0x7fb658a84f1f in ???
> #3 0x556a6ac73f76 in gpu_upload_frc_
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> #4 0x556a6abeb897 in pmemd
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> #5 0x556a6abecbb3 in main
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> run_direct.sh: line 19: 12686 Floating point exception(core dumped)
> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> *Run 2: *
> Program received signal SIGFPE: Floating-point exception - erroneous
> arithmetic operation.
> Backtrace for this error:
> #0 0x7f8780b802da in ???
> #1 0x7f8780b7f503 in ???
> #2 0x7f877fe73f1f in ???
> #3 0x557f1d950f76 in gpu_upload_frc_
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> #4 0x557f1d8c8897 in pmemd
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> #5 0x557f1d8c9bb3 in main
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> run_direct.sh: line 20: 12691 Floating point exception(core dumped)
> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> *Run 3:*
> Note: The following floating-point exceptions are signalling:
> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> *Run 4:*
> Note: The following floating-point exceptions are signalling:
> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> *Run 5:*
> Note: The following floating-point exceptions are signalling:
> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> *Run 6:*
> Program received signal SIGFPE: Floating-point exception - erroneous
> arithmetic operation.
> Backtrace for this error:
> #0 0x7fa7cca442da in ???
> #1 0x7fa7cca43503 in ???
> #2 0x7fa7cbd37f1f in ???
> #3 0x559568d8cf76 in gpu_upload_frc_
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
> #4 0x559568d04897 in pmemd
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> #5 0x559568d05bb3 in main
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> run_direct.sh: line 24: 12708 Floating point exception(core dumped)
> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> *Run 7: *
> Note: The following floating-point exceptions are signalling:
> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> *Run 8:*
> Program received signal SIGFPE: Floating-point exception - erroneous
> arithmetic operation.
> Backtrace for this error:
> #0 0x7f32888232da in ???
> #1 0x7f3288822503 in ???
> #2 0x7f3287b16f1f in ???
> #3 0x55ae7dec2f13 in gpu_upload_frc_
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
> #4 0x55ae7de3a897 in pmemd
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
> #5 0x55ae7de3bbb3 in main
> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
> run_direct.sh: line 26: 12717 Floating point exception(core dumped)
> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
> *Run 9:*
> Note: The following floating-point exceptions are signalling:
> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> *Run 10:*
> Note: The following floating-point exceptions are signalling:
> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>
> Thank you!
>
> Respectfully,
>
> Kellon
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Nov 07 2018 - 00:30:02 PST
Custom Search