Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 8 Nov 2018 07:22:35 -0800

I am not sure that's a bug. The atm_frc array needs to be initialized to zero for MPI because it will do an mpi_reduce into that array so not every element will be touched by every core and it will do an addition to the existing element rather than an assignment. When running in serial values are assigned to each atm_frc element rather than added/reduced and thus it should not be necessary to initialize the array.

I would trace through the code and find out exactly where values are stored into the atm_frc array in serial to determine if the initialization is needed.

All the best
Ross

> On Nov 8, 2018, at 00:16, Josh Berryman <the.real.josh.berryman.gmail.com> wrote:
>
>>> I looked at the code and found that atm_frc is only initialized to zero
> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
> parallel.F90:2146: atm_frc(:,:) = 0.d0).
> Well that looks like a classic memory management bug that you should commit
> a fix for.
>
> If you are mailing the developer's list then I guess that means you have
> gitlab access?
>
> Before committing, maybe send an email to whoever else has been committing
> the most to pmemd.F90 recently, but basically it looks as if you have
> sorted this out yourself.
>
> Josh
>
>
>
>
>
> On Wed, 7 Nov 2018 at 23:07, Kellon Belfon <kellonbelfon.gmail.com> wrote:
>
>> Follow up to my previous email.
>>
>> I set atm_frc array to be zero to test the whether the large value in the
>> atm_frc array is garbage values/old values.
>>
>> To do this, at line 414 in pmemd.F90 I added the following line
>> atm_frc(:,:) = 0.d0
>>
>> Then I ran the calculations in triplicate of 10 times and I did not get the
>> overflow error anymore. I think this suggests that the large values in the
>> atm_frc array might be garbage values since atm_frc is not initialized to
>> be zero?
>> I looked at the code and found that atm_frc is only initialized to zero
>> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
>> parallel.F90:2146: atm_frc(:,:) = 0.d0).
>>
>> Also, I compared the mdout files with and without the atm_frc(:,:) = 0.d0
>> at line 414 in pmemd.F90 and the results are the same for the system I
>> tested. I am not sure if this will break anything else. I can run the amber
>> standard test case, if this might help?
>>
>> Respectfully,
>>
>> Kellon
>>
>>
>>
>>
>>
>> On Wed, Nov 7, 2018 at 1:40 PM Kellon Belfon <kellonbelfon.gmail.com>
>> wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for the response. I think in this case it is not because of my
>>> system. We have tested other systems in our lab and all of these systems
>>> give the same overflow notes sometime. This happens during production
>> runs
>>> too and the behavior is random.
>>>
>>> The overflow only occurs at initialization when the force array should be
>>> zero. Displaying the values show that the array is mostly zero except
>> for a
>>> sporadic large number in between. This is not always on the same atom and
>>> doing ten runs in triplicate showed the value is zero sometimes and
>>> sometimes it is this large number. I am thinking this might just be
>> garbage
>>> values or old values from previous use of the register. Is this a normal
>>> behavior of the code or have you ever seen this issue before?
>>>
>>> Thanks,
>>>
>>> Kellon
>>>
>>>
>>> On Wed, Nov 7, 2018 at 3:13 AM Josh Berryman <
>>> the.real.josh.berryman.gmail.com> wrote:
>>>
>>>> Hi Kellon, if you are getting forces in the region 1e29 then your system
>>>> has severe steric clashes in it: the main problem with Lennard-Jones for
>>>> non-bonded interactions is that it diverges quickly for close approach
>> of
>>>> atoms. Probably the answer in that case is to run for 1-10ps with very
>>>> high Langevin coupling and small timestep on the CPU, or to use xmin
>>>> option
>>>> to pre-stabilise your system (again on the CPU, where the 64 bit
>> datatype
>>>> for floats will give you more headroom against overflows).
>>>>
>>>> google suggests that (as you say you have seen yourself) IEEE_DENORMAL
>> is
>>>> not a problem, it describes cases where a number comes close enough to
>>>> zero
>>>> that maybe it should just be rounded to zero anyway
>>>>
>>>> Josh
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster
>>>> and
>>>>> started getting the following note for our gpu calculations in Amber.
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>
>>>>>> From this previous post (http://archive.ambermd.org/201804/0130.html
>> ),
>>>>> the
>>>>> response was pretty much do not worry about them. Does this apply to
>>>>> IEEE_OVERFLOW_FLAG as well?
>>>>> Also running the calculation with pmemd.cuda_DPFP does not produce the
>>>>> underflow note. I was thinking maybe it is from mixing floats and
>>>> doubles?
>>>>>
>>>>> We are getting the overflow note for some of our calculations, but it
>>>> does
>>>>> not affect the results. I also used cuda-gdb using -G -g
>>>>> -ffpe-trap=overflow flags, to stop the code where the overflow
>> occurs. I
>>>>> found that the overflow occurs in gpu_upload_frc(), during the first
>>>> upload
>>>>> of the forces as the system is being initialized on the GPU.
>>>>> Further debugging showed the note occurs when the atm_frc array has
>>>> values
>>>>> that are not zero but instead a large number (atm_frc[i][1] =
>>>>> -1.5739204096161189e+29). I think this large number causes the
>> overflow
>>>>> note since the calculation fails right after (The calculation fails
>>>> because
>>>>> I set the ffpe-trap).
>>>>>
>>>>> I then ran the same calculation ten times (with the ffpe-trap, if
>> there
>>>> is
>>>>> an overflow the calculations will fail) and I get the overflow 4 out
>> of
>>>> 10
>>>>> time. Then I repeat for another 10 times in triplicate (2/10, 5/10,
>> 1/10
>>>>> overflow). It seem like an unpredictable note. For really small
>> numbers,
>>>>> multiplying by the forcescale does the trick but for these large
>>>> numbers it
>>>>> is causing the overflow note and the behavior is unpredictable. Does
>>>> anyone
>>>>> has any advice on this? Should we just ignore, since the results are
>>>> okay?
>>>>>
>>>>> Below are the results for one of the trial:
>>>>> *Run 1:*
>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>> arithmetic operation.
>>>>> Backtrace for this error:
>>>>> #0 0x7fb6597912da in ???
>>>>> #1 0x7fb659790503 in ???
>>>>> #2 0x7fb658a84f1f in ???
>>>>> #3 0x556a6ac73f76 in gpu_upload_frc_
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>>>>> #4 0x556a6abeb897 in pmemd
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>> #5 0x556a6abecbb3 in main
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>> run_direct.sh: line 19: 12686 Floating point exception(core dumped)
>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>> *Run 2: *
>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>> arithmetic operation.
>>>>> Backtrace for this error:
>>>>> #0 0x7f8780b802da in ???
>>>>> #1 0x7f8780b7f503 in ???
>>>>> #2 0x7f877fe73f1f in ???
>>>>> #3 0x557f1d950f76 in gpu_upload_frc_
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>>>>> #4 0x557f1d8c8897 in pmemd
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>> #5 0x557f1d8c9bb3 in main
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>> run_direct.sh: line 20: 12691 Floating point exception(core dumped)
>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>> *Run 3:*
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>> *Run 4:*
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>> *Run 5:*
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>> *Run 6:*
>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>> arithmetic operation.
>>>>> Backtrace for this error:
>>>>> #0 0x7fa7cca442da in ???
>>>>> #1 0x7fa7cca43503 in ???
>>>>> #2 0x7fa7cbd37f1f in ???
>>>>> #3 0x559568d8cf76 in gpu_upload_frc_
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>>>>> #4 0x559568d04897 in pmemd
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>> #5 0x559568d05bb3 in main
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>> run_direct.sh: line 24: 12708 Floating point exception(core dumped)
>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>> *Run 7: *
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>> *Run 8:*
>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>> arithmetic operation.
>>>>> Backtrace for this error:
>>>>> #0 0x7f32888232da in ???
>>>>> #1 0x7f3288822503 in ???
>>>>> #2 0x7f3287b16f1f in ???
>>>>> #3 0x55ae7dec2f13 in gpu_upload_frc_
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
>>>>> #4 0x55ae7de3a897 in pmemd
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>> #5 0x55ae7de3bbb3 in main
>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>> run_direct.sh: line 26: 12717 Floating point exception(core dumped)
>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>> *Run 9:*
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>> *Run 10:*
>>>>> Note: The following floating-point exceptions are signalling:
>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Respectfully,
>>>>>
>>>>> Kellon
>>>>> _______________________________________________
>>>>> AMBER-Developers mailing list
>>>>> AMBER-Developers.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>>>
>>>> _______________________________________________
>>>> AMBER-Developers mailing list
>>>> AMBER-Developers.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>>
>>>
>>
>> --
>> Kellon A. A. Belfon, Graduate Student
>> Carlos Simmerling Laboratory
>> The Laufer Center for Physical and Quantitative Biology
>> The Department of Chemistry, Stony Brook University
>> Stony Brook, New York 11794
>> Phone: (347) 546-4237 <(347)+546+4237> Email: kellon.belfon.stonybrook.
>> <kellon.belfon.stonybrook.edu>edu
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers


_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Nov 08 2018 - 07:30:03 PST
Custom Search