[AMBER-Developers] IEEE_OVERFLOW_FLAG

From: Kellon Belfon <kellonbelfon.gmail.com>
Date: Tue, 6 Nov 2018 14:16:43 -0500

Hi Everyone,

We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster and
started getting the following note for our gpu calculations in Amber.
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

>From this previous post (http://archive.ambermd.org/201804/0130.html), the
response was pretty much do not worry about them. Does this apply to
IEEE_OVERFLOW_FLAG as well?
Also running the calculation with pmemd.cuda_DPFP does not produce the
underflow note. I was thinking maybe it is from mixing floats and doubles?

We are getting the overflow note for some of our calculations, but it does
not affect the results. I also used cuda-gdb using -G -g
-ffpe-trap=overflow flags, to stop the code where the overflow occurs. I
found that the overflow occurs in gpu_upload_frc(), during the first upload
of the forces as the system is being initialized on the GPU.
Further debugging showed the note occurs when the atm_frc array has values
that are not zero but instead a large number (atm_frc[i][1] =
-1.5739204096161189e+29). I think this large number causes the overflow
note since the calculation fails right after (The calculation fails because
I set the ffpe-trap).

I then ran the same calculation ten times (with the ffpe-trap, if there is
an overflow the calculations will fail) and I get the overflow 4 out of 10
time. Then I repeat for another 10 times in triplicate (2/10, 5/10, 1/10
overflow). It seem like an unpredictable note. For really small numbers,
multiplying by the forcescale does the trick but for these large numbers it
is causing the overflow note and the behavior is unpredictable. Does anyone
has any advice on this? Should we just ignore, since the results are okay?

Below are the results for one of the trial:
*Run 1:*
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
#0 0x7fb6597912da in ???
#1 0x7fb659790503 in ???
#2 0x7fb658a84f1f in ???
#3 0x556a6ac73f76 in gpu_upload_frc_
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
#4 0x556a6abeb897 in pmemd
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
#5 0x556a6abecbb3 in main
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
run_direct.sh: line 19: 12686 Floating point exception(core dumped)
${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
*Run 2: *
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
#0 0x7f8780b802da in ???
#1 0x7f8780b7f503 in ???
#2 0x7f877fe73f1f in ???
#3 0x557f1d950f76 in gpu_upload_frc_
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
#4 0x557f1d8c8897 in pmemd
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
#5 0x557f1d8c9bb3 in main
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
run_direct.sh: line 20: 12691 Floating point exception(core dumped)
${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
*Run 3:*
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
*Run 4:*
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
*Run 5:*
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
*Run 6:*
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
#0 0x7fa7cca442da in ???
#1 0x7fa7cca43503 in ???
#2 0x7fa7cbd37f1f in ???
#3 0x559568d8cf76 in gpu_upload_frc_
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
#4 0x559568d04897 in pmemd
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
#5 0x559568d05bb3 in main
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
run_direct.sh: line 24: 12708 Floating point exception(core dumped)
${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
*Run 7: *
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
*Run 8:*
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
#0 0x7f32888232da in ???
#1 0x7f3288822503 in ???
#2 0x7f3287b16f1f in ???
#3 0x55ae7dec2f13 in gpu_upload_frc_
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
#4 0x55ae7de3a897 in pmemd
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
#5 0x55ae7de3bbb3 in main
    at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
run_direct.sh: line 26: 12717 Floating point exception(core dumped)
${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
*Run 9:*
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
*Run 10:*
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

Thank you!

Respectfully,

Kellon
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Tue Nov 06 2018 - 11:30:03 PST
Custom Search