Re: [AMBER-Developers] IEEE_OVERFLOW_FLAG from Ross Walker on 2018-11-09 (Amber Developers Archive Nov 2018)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 9 Nov 2018 20:22:53 -0500

Hi Kellon,

I agree I don't see any issues with initializing the array during setup. There is a performance hit to initializing arrays when one doesn't need to but if it only occurs once during startup it should not be an issue. The issue is if such initialization occurs every step then it would be a bad idea. The IEEE warning is innocuous but if zeroing the array at startup immediately after allocating (I.e. doing it always not just when running with MPI) get rids of that warning then it has my vote.

All the best
Ross

> On Nov 9, 2018, at 14:28, Kellon Belfon <kellonbelfon.gmail.com> wrote:
>
> Thank you Josh and Ross for your response, advise and suggestions. I
> definitely appreciate your time and wisdom.
>
> This is what I gathered:
> (1) atm_frc is allocated,
> (2) values from atm_frc is uploaded to the GPU to populate pForce array
> during initial setup,
> (3) Whenever gpu_download_frc is called (mostly for MPI), then atm_frc is
> repopulated with the actual forces from pForce array.
> *gpu_download_frc is not called in my runs
>
> But it appears that the garbage values in the atm_frc array get overwritten
> by the actual force during the download_frc call, so it will not affect the
> calculations if atm_frc is not initialize. Therefore it is not really a
> bug. I was thinking it might be just a good approach to initialize the
> array to get rid of the overflow note.
>
> I ran the amber tests after initializing the atm_frc array to zero and they
> all passed. Also the garbage values in the atm_frc array are values that
> were previously stored at that memory and not from the current
> calculations.
>
> Respectfully,
>
> Kellon
>
> On Thu, Nov 8, 2018 at 10:24 AM Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> I am not sure that's a bug. The atm_frc array needs to be initialized to
>> zero for MPI because it will do an mpi_reduce into that array so not every
>> element will be touched by every core and it will do an addition to the
>> existing element rather than an assignment. When running in serial values
>> are assigned to each atm_frc element rather than added/reduced and thus it
>> should not be necessary to initialize the array.
>>
>> I would trace through the code and find out exactly where values are
>> stored into the atm_frc array in serial to determine if the initialization
>> is needed.
>>
>> All the best
>> Ross
>>
>>> On Nov 8, 2018, at 00:16, Josh Berryman <
>> the.real.josh.berryman.gmail.com> wrote:
>>>
>>>>> I looked at the code and found that atm_frc is only initialized to
>> zero
>>> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
>>> parallel.F90:2146: atm_frc(:,:) = 0.d0).
>>> Well that looks like a classic memory management bug that you should
>> commit
>>> a fix for.
>>>
>>> If you are mailing the developer's list then I guess that means you have
>>> gitlab access?
>>>
>>> Before committing, maybe send an email to whoever else has been
>> committing
>>> the most to pmemd.F90 recently, but basically it looks as if you have
>>> sorted this out yourself.
>>>
>>> Josh
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 7 Nov 2018 at 23:07, Kellon Belfon <kellonbelfon.gmail.com>
>> wrote:
>>>
>>>> Follow up to my previous email.
>>>>
>>>> I set atm_frc array to be zero to test the whether the large value in
>> the
>>>> atm_frc array is garbage values/old values.
>>>>
>>>> To do this, at line 414 in pmemd.F90 I added the following line
>>>> atm_frc(:,:) = 0.d0
>>>>
>>>> Then I ran the calculations in triplicate of 10 times and I did not get
>> the
>>>> overflow error anymore. I think this suggests that the large values in
>> the
>>>> atm_frc array might be garbage values since atm_frc is not initialized
>> to
>>>> be zero?
>>>> I looked at the code and found that atm_frc is only initialized to zero
>>>> when MPI is called (inpcrd_dat.F90:481: atm_frc(:,:) = 0.d0 and
>>>> parallel.F90:2146: atm_frc(:,:) = 0.d0).
>>>>
>>>> Also, I compared the mdout files with and without the atm_frc(:,:) =
>> 0.d0
>>>> at line 414 in pmemd.F90 and the results are the same for the system I
>>>> tested. I am not sure if this will break anything else. I can run the
>> amber
>>>> standard test case, if this might help?
>>>>
>>>> Respectfully,
>>>>
>>>> Kellon
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 7, 2018 at 1:40 PM Kellon Belfon <kellonbelfon.gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Josh,
>>>>>
>>>>> Thanks for the response. I think in this case it is not because of my
>>>>> system. We have tested other systems in our lab and all of these
>> systems
>>>>> give the same overflow notes sometime. This happens during production
>>>> runs
>>>>> too and the behavior is random.
>>>>>
>>>>> The overflow only occurs at initialization when the force array should
>> be
>>>>> zero. Displaying the values show that the array is mostly zero except
>>>> for a
>>>>> sporadic large number in between. This is not always on the same atom
>> and
>>>>> doing ten runs in triplicate showed the value is zero sometimes and
>>>>> sometimes it is this large number. I am thinking this might just be
>>>> garbage
>>>>> values or old values from previous use of the register. Is this a
>> normal
>>>>> behavior of the code or have you ever seen this issue before?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Kellon
>>>>>
>>>>>
>>>>> On Wed, Nov 7, 2018 at 3:13 AM Josh Berryman <
>>>>> the.real.josh.berryman.gmail.com> wrote:
>>>>>
>>>>>> Hi Kellon, if you are getting forces in the region 1e29 then your
>> system
>>>>>> has severe steric clashes in it: the main problem with Lennard-Jones
>> for
>>>>>> non-bonded interactions is that it diverges quickly for close approach
>>>> of
>>>>>> atoms. Probably the answer in that case is to run for 1-10ps with
>> very
>>>>>> high Langevin coupling and small timestep on the CPU, or to use xmin
>>>>>> option
>>>>>> to pre-stabilise your system (again on the CPU, where the 64 bit
>>>> datatype
>>>>>> for floats will give you more headroom against overflows).
>>>>>>
>>>>>> google suggests that (as you say you have seen yourself)
>> IEEE_DENORMAL
>>>> is
>>>>>> not a problem, it describes cases where a number comes close enough to
>>>>>> zero
>>>>>> that maybe it should just be rounded to zero anyway
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 6 Nov 2018 at 20:17, Kellon Belfon <kellonbelfon.gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> We recently upgraded our compiler (gnu 4.8.4 to 7.3.0) on our cluster
>>>>>> and
>>>>>>> started getting the following note for our gpu calculations in Amber.
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>>
>>>>>>>> From this previous post (
>> http://archive.ambermd.org/201804/0130.html
>>>> ),
>>>>>>> the
>>>>>>> response was pretty much do not worry about them. Does this apply to
>>>>>>> IEEE_OVERFLOW_FLAG as well?
>>>>>>> Also running the calculation with pmemd.cuda_DPFP does not produce
>> the
>>>>>>> underflow note. I was thinking maybe it is from mixing floats and
>>>>>> doubles?
>>>>>>>
>>>>>>> We are getting the overflow note for some of our calculations, but it
>>>>>> does
>>>>>>> not affect the results. I also used cuda-gdb using -G -g
>>>>>>> -ffpe-trap=overflow flags, to stop the code where the overflow
>>>> occurs. I
>>>>>>> found that the overflow occurs in gpu_upload_frc(), during the first
>>>>>> upload
>>>>>>> of the forces as the system is being initialized on the GPU.
>>>>>>> Further debugging showed the note occurs when the atm_frc array has
>>>>>> values
>>>>>>> that are not zero but instead a large number (atm_frc[i][1] =
>>>>>>> -1.5739204096161189e+29). I think this large number causes the
>>>> overflow
>>>>>>> note since the calculation fails right after (The calculation fails
>>>>>> because
>>>>>>> I set the ffpe-trap).
>>>>>>>
>>>>>>> I then ran the same calculation ten times (with the ffpe-trap, if
>>>> there
>>>>>> is
>>>>>>> an overflow the calculations will fail) and I get the overflow 4 out
>>>> of
>>>>>> 10
>>>>>>> time. Then I repeat for another 10 times in triplicate (2/10, 5/10,
>>>> 1/10
>>>>>>> overflow). It seem like an unpredictable note. For really small
>>>> numbers,
>>>>>>> multiplying by the forcescale does the trick but for these large
>>>>>> numbers it
>>>>>>> is causing the overflow note and the behavior is unpredictable. Does
>>>>>> anyone
>>>>>>> has any advice on this? Should we just ignore, since the results are
>>>>>> okay?
>>>>>>>
>>>>>>> Below are the results for one of the trial:
>>>>>>> *Run 1:*
>>>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>>>> arithmetic operation.
>>>>>>> Backtrace for this error:
>>>>>>> #0 0x7fb6597912da in ???
>>>>>>> #1 0x7fb659790503 in ???
>>>>>>> #2 0x7fb658a84f1f in ???
>>>>>>> #3 0x556a6ac73f76 in gpu_upload_frc_
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>>>>>>> #4 0x556a6abeb897 in pmemd
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>>>> #5 0x556a6abecbb3 in main
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>>>> run_direct.sh: line 19: 12686 Floating point exception(core dumped)
>>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>>>> *Run 2: *
>>>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>>>> arithmetic operation.
>>>>>>> Backtrace for this error:
>>>>>>> #0 0x7f8780b802da in ???
>>>>>>> #1 0x7f8780b7f503 in ???
>>>>>>> #2 0x7f877fe73f1f in ???
>>>>>>> #3 0x557f1d950f76 in gpu_upload_frc_
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>>>>>>> #4 0x557f1d8c8897 in pmemd
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>>>> #5 0x557f1d8c9bb3 in main
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>>>> run_direct.sh: line 20: 12691 Floating point exception(core dumped)
>>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>>>> *Run 3:*
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>> *Run 4:*
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>> *Run 5:*
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>> *Run 6:*
>>>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>>>> arithmetic operation.
>>>>>>> Backtrace for this error:
>>>>>>> #0 0x7fa7cca442da in ???
>>>>>>> #1 0x7fa7cca43503 in ???
>>>>>>> #2 0x7fa7cbd37f1f in ???
>>>>>>> #3 0x559568d8cf76 in gpu_upload_frc_
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1675
>>>>>>> #4 0x559568d04897 in pmemd
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>>>> #5 0x559568d05bb3 in main
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>>>> run_direct.sh: line 24: 12708 Floating point exception(core dumped)
>>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>>>> *Run 7: *
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>> *Run 8:*
>>>>>>> Program received signal SIGFPE: Floating-point exception - erroneous
>>>>>>> arithmetic operation.
>>>>>>> Backtrace for this error:
>>>>>>> #0 0x7f32888232da in ???
>>>>>>> #1 0x7f3288822503 in ???
>>>>>>> #2 0x7f3287b16f1f in ???
>>>>>>> #3 0x55ae7dec2f13 in gpu_upload_frc_
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/cuda/gpu.cpp:1673
>>>>>>> #4 0x55ae7de3a897 in pmemd
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:445
>>>>>>> #5 0x55ae7de3bbb3 in main
>>>>>>> at /mnt/raidc2/kbelfon/amber18/src/pmemd/src/pmemd.F90:77
>>>>>>> run_direct.sh: line 26: 12717 Floating point exception(core dumped)
>>>>>>> ${pmemdGPU} -O -i ./test.in -p ./5awl.opc.parm7 -c ./8md.rst7 -ref
>>>>>>> ./8md.rst7 -o ./md1.out -x ./md1.x -inf ./md1.info -r ./md1.rst7
>>>>>>> *Run 9:*
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>> *Run 10:*
>>>>>>> Note: The following floating-point exceptions are signalling:
>>>>>>> IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Respectfully,
>>>>>>>
>>>>>>> Kellon
>>>>>>> _______________________________________________
>>>>>>> AMBER-Developers mailing list
>>>>>>> AMBER-Developers.ambermd.org
>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER-Developers mailing list
>>>>>> AMBER-Developers.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Kellon A. A. Belfon, Graduate Student
>>>> Carlos Simmerling Laboratory
>>>> The Laufer Center for Physical and Quantitative Biology
>>>> The Department of Chemistry, Stony Brook University
>>>> Stony Brook, New York 11794
>>>> Phone: (347) 546-4237 <(347)+546+4237> Email:
>> kellon.belfon.stonybrook.
>>>> <kellon.belfon.stonybrook.edu>edu
>>>> _______________________________________________
>>>> AMBER-Developers mailing list
>>>> AMBER-Developers.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>>
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
>
> --
> Kellon A. A. Belfon, Graduate Student
> Carlos Simmerling Laboratory
> The Laufer Center for Physical and Quantitative Biology
> The Department of Chemistry, Stony Brook University
> Stony Brook, New York 11794
> Phone: (347) 546-4237 <(347)+546+4237> Email: kellon.belfon.stonybrook.
> <kellon.belfon.stonybrook.edu>edu
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Nov 09 2018 - 17:30:02 PST