Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG

From: Kellon Belfon <kellonbelfon.gmail.com>
Date: Wed, 7 Nov 2018 17:07:06 -0500

Follow up to my previous email.

I set atm_frc array to be zero to test the whether the large value in the
atm_frc array is garbage values/old values.

To do this, at line 414 in pmemd.F90 I added the following line
atm_frc(:,:) = 0.d0

Then I ran the calculations in triplicate of 10 times and I did not get the
overflow error anymore. I think this suggests that the large values in the
atm_frc array might be garbage values since atm_frc is not initialized to
be zero?
I looked at the code and found that atm_frc is only initialized to zero
when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
parallel.F90:2146: atm_frc(:,:) = 0.d0).

Also, I compared the mdout files with and without the atm_frc(:,:) = 0.d0
at line 414 in pmemd.F90 and the results are the same for the system I
tested. I am not sure if this will break anything else. I can run the amber
standard test case, if this might help?

Respectfully,

Kellon





On Wed, Nov 7, 2018 at 1:40 PM Kellon Belfon <kellonbelfon.gmail.com> wrote:

> Hi Josh,
>
> Thanks for the response. I think in this case it is not because of my
> system. We have tested other systems in our lab and all of these systems
> give the same overflow notes sometime. This happens during production runs
> too and the behavior is random.
>
> The overflow only occurs at initialization when the force array should be
> zero. Displaying the values show that the array is mostly zero except for a
> sporadic large number in between. This is not always on the same atom and
> doing ten runs in triplicate showed the value is zero sometimes and
> sometimes it is this large number. I am thinking this might just be garbage
> values or old values from previous use of the register. Is this a normal
> behavior of the code or have you ever seen this issue before?
>
> Thanks,
>
> Kellon
>
>
> On Wed, Nov 7, 2018 at 3:13 AM Josh Berryman <
> the.real.josh.berryman.gmail.com> wrote:
>
>> Hi Kellon, if you are getting forces in the region 1e29 then your system
>> has severe steric clashes in it: the main problem with Lennard-Jones for
>> non-bonded interactions is that it diverges quickly for close approach of
>> atoms. Probably the answer in that case is to run for 1-10ps with very
>> high Langevin coupling and small timestep on the CPU, or to use xmin
>> option
>> to pre-stabilise your system (again on the CPU, where the 64 bit datatype
>> for floats will give you more headroom against overflows).
>>
>> google suggests that (as you say you have seen yourself) IEEE_DENORMAL is
>> not a problem, it describes cases where a number comes close enough to
>> zero
>> that maybe it should just be rounded to zero anyway
>>
>> Josh
>>
>>
>>
>>
>> On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com>
>> wrote:
>>
>> > Hi Everyone,
>> >
>> > We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster
>> and
>> > started getting the following note for our gpu calculations in Amber.
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> >
>> > >From this previous post (http://archive.ambermd.org/201804/0130.html),
>> > the
>> > response was pretty much do not worry about them. Does this apply to
>> > IEEE_OVERFLOW_FLAG as well?
>> > Also running the calculation with pmemd.cuda_DPFP does not produce the
>> > underflow note. I was thinking maybe it is from mixing floats and
>> doubles?
>> >
>> > We are getting the overflow note for some of our calculations, but it
>> does
>> > not affect the results. I also used cuda-gdb using -G -g
>> > -ffpe-trap=overflow flags, to stop the code where the overflow occurs. I
>> > found that the overflow occurs in gpu_upload_frc(), during the first
>> upload
>> > of the forces as the system is being initialized on the GPU.
>> > Further debugging showed the note occurs when the atm_frc array has
>> values
>> > that are not zero but instead a large number (atm_frc[i][1] =
>> > -1.5739204096161189e+29). I think this large number causes the overflow
>> > note since the calculation fails right after (The calculation fails
>> because
>> > I set the ffpe-trap).
>> >
>> > I then ran the same calculation ten times (with the ffpe-trap, if there
>> is
>> > an overflow the calculations will fail) and I get the overflow 4 out of
>> 10
>> > time. Then I repeat for another 10 times in triplicate (2/10, 5/10, 1/10
>> > overflow). It seem like an unpredictable note. For really small numbers,
>> > multiplying by the forcescale does the trick but for these large
>> numbers it
>> > is causing the overflow note and the behavior is unpredictable. Does
>> anyone
>> > has any advice on this? Should we just ignore, since the results are
>> okay?
>> >
>> > Below are the results for one of the trial:
>> > *Run 1:*
>> > Program received signal SIGFPE: Floating-point exception - erroneous
>> > arithmetic operation.
>> > Backtrace for this error:
>> > #0 0x7fb6597912da in ???
>> > #1 0x7fb659790503 in ???
>> > #2 0x7fb658a84f1f in ???
>> > #3 0x556a6ac73f76 in gpu_upload_frc_
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>> > #4 0x556a6abeb897 in pmemd
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>> > #5 0x556a6abecbb3 in main
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>> > run_direct.sh: line 19: 12686 Floating point exception(core dumped)
>> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>> > *Run 2: *
>> > Program received signal SIGFPE: Floating-point exception - erroneous
>> > arithmetic operation.
>> > Backtrace for this error:
>> > #0 0x7f8780b802da in ???
>> > #1 0x7f8780b7f503 in ???
>> > #2 0x7f877fe73f1f in ???
>> > #3 0x557f1d950f76 in gpu_upload_frc_
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>> > #4 0x557f1d8c8897 in pmemd
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>> > #5 0x557f1d8c9bb3 in main
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>> > run_direct.sh: line 20: 12691 Floating point exception(core dumped)
>> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>> > *Run 3:*
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> > *Run 4:*
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> > *Run 5:*
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> > *Run 6:*
>> > Program received signal SIGFPE: Floating-point exception - erroneous
>> > arithmetic operation.
>> > Backtrace for this error:
>> > #0 0x7fa7cca442da in ???
>> > #1 0x7fa7cca43503 in ???
>> > #2 0x7fa7cbd37f1f in ???
>> > #3 0x559568d8cf76 in gpu_upload_frc_
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>> > #4 0x559568d04897 in pmemd
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>> > #5 0x559568d05bb3 in main
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>> > run_direct.sh: line 24: 12708 Floating point exception(core dumped)
>> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>> > *Run 7: *
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> > *Run 8:*
>> > Program received signal SIGFPE: Floating-point exception - erroneous
>> > arithmetic operation.
>> > Backtrace for this error:
>> > #0 0x7f32888232da in ???
>> > #1 0x7f3288822503 in ???
>> > #2 0x7f3287b16f1f in ???
>> > #3 0x55ae7dec2f13 in gpu_upload_frc_
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
>> > #4 0x55ae7de3a897 in pmemd
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>> > #5 0x55ae7de3bbb3 in main
>> > at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>> > run_direct.sh: line 26: 12717 Floating point exception(core dumped)
>> > ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>> > ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>> > *Run 9:*
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> > *Run 10:*
>> > Note: The following floating-point exceptions are signalling:
>> > IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>> >
>> > Thank you!
>> >
>> > Respectfully,
>> >
>> > Kellon
>> > _______________________________________________
>> > AMBER-Developers mailing list
>> > AMBER-Developers.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>

-- 
Kellon A. A. Belfon, Graduate Student
Carlos Simmerling Laboratory
The Laufer Center for Physical and Quantitative Biology
The Department of Chemistry, Stony Brook University
Stony Brook, New York 11794
Phone: (347) 546-4237 <(347)+546+4237>  Email:  kellon.belfon.stonybrook.
<kellon.belfon.stonybrook.edu>edu
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Nov 07 2018 - 14:30:03 PST
Custom Search