amber-developers: RE: AMBER: Amber9 serial installation -tests- severe (174): SIGSEGV, segmentation fault occurred

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 19 Oct 2006 18:46:35 -0700

Hi Kennie and others.
 
>Finally, most modern compilers will reorganize these kinds of do loops in a
more efficient manner,
>so there is always a tension between readable code and doing the compilers
job for it. Any optimizing
>compiler worth a damn can deal with these constructs
 
You would indeed think that the compiler should optimise out the if
statement but in my experience this is not the case and in fact most modern
compilers are pretty aweful when it comes to optimising things like this and
you have to do almost all the work for the compiler. E.g. for example take
the following test case that I have attached to this email. There are 3
files which differ in the following loop:
 
speed_test1
-----------------
  ncount=1
  do i = 1, iloop
    do j = 1, iloop
      array1(j) = ncount
      if (i_do_extra>0) array2(j) = ncount
      ncount=ncount+1
    end do
  end do
 
speed_test2
-----------------
  do i = 1, iloop
    if (i_do_extra>0) then
      do j = 1, iloop
        array1(j) = ncount
        array2(j) = ncount
        ncount=ncount+1
      end do
    else
      do j = 1, iloop
        array1(j) = ncount
        ncount=ncount+1
      end do
    end if
  end do
 
speed_test3
-----------------
  if (i_do_extra>0) then
    do i = 1, iloop
      do j = 1, iloop
        array1(j) = ncount
        array2(j) = ncount
        ncount=ncount+1
      end do
    end do
  else
    do i = 1, iloop
      do j = 1, iloop
        array1(j) = ncount
        ncount=ncount+1
      end do
    end do
  end if
 
So in speed_test 1 I have the if statement inside the loop in the same
fashion that the divcon code had it. In case 2 I have factored the if
statement out of the inner loop. And in case 3 I have factored it out of
both loops.
 
Now you would think that any compiler worth it's salt would spot this and be
able to optimise out the if statement. And in fact it probably does.
However, most modern cpus have vector single instruction multiple data
registers. MMX for integer, SSE for single precision FP and SSE2 for double
precision FP. The filling of array1 and array2 can be vectorised in the case
of speed_Test3 and speed_test2 but can't be in speed_test1. Although
potentially the compiler should factor the if statement out of the loop and
re-write it internally as two loops that can be vectorised. However, I have
never found a compiler that does this. Also ultimately a human could also
realize that the do i loop is almost entirely redundant and the array1 and
array2 arrays only actually need to be filled on the last j loop. We can let
the compiler off on this point though as it has to assume that some other
thread might be reading that info while it is doing other things. And that
type of modification of code might be considered one step beyond simply
optimising. Most compilers (all compilers?) can't even cope with the
simplest level of re-writing though.
 
E.g. Here are the timings for 2 modes of operation running on my 3.2GHz
Pentium-D.
 
1) iloop = 50000 and i_do_extra=1
2) iloop = 50000 and i_do_extra=0
 
If you want to run the code for yourself you just compile it with the
compiler options you want and run it with ./a.out 50000 1
where arg1 = iloop and arg2 = i_do_extra.
 
I compiled this code in 3 ways. Firstly with "g95 -O3", then with "ifort
-O3" and finally with "ifort -O3 -xP". The -xP adds SSE2 and MMX
vectorisation and indeed the compiler (as I expected) vectorized both loops
in test case 2 and 3 but couldn't do any vectorization in example 1. Thus
you get very different peformance numbers as follows:
 
i_do_extra = 1 iloop = 50,000
Time in seconds
speed test g95_-O3 ifort_-O3 ifort_-O3_-axNP
1 8.4216 5.0816 5.0944
2 6.1729 5.0688 4.2752
3 6.1373 5.1072 4.2624
 
i_do_extra = 0
Time in seconds
speed test g95_-O3 ifort_-O3 ifort_-O3_-axNP
1 4.8720 2.4448 2.3680
2 4.8564 2.1888 1.9840
3 4.8556 2.2016 1.9840
 
So as you see in all cases the optimum code is to have the if statement
manually factored out of both loops. In fact with the Intel compiler at full
optimization this is
over 16% faster in both cases whether array2 is filled or not. Hence having
the if statement inside the inner loop really hurts you and compilers do not
fix this. Note I see the same behaviour with gfortran and with IBM's xlf90.
 
I see this far too often. People make assumptions of what compilers will do
based on heresay (and common sense ;-) ) but never actually stop and check
if there thinking is true.
 
The moral of this story... "Don't trust anything you hear about compilers
and subject everything to the scientific method." The best thing you can do
is code as if the compiler is completely stupid and then you are less likely
to go wrong.
 
The buggyness of the Intel compiler is another story altogether.
 
All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

 


  _____

From: Ken Merz [mailto:merz.qtp.ufl.edu]
Sent: Thursday, October 19, 2006 14:44
To: David A. Case; Ross Walker
Cc: Bing Wang; Kenneth Ayers; Seth Hayik
Subject: Re: AMBER: Amber9 serial installation -tests- severe (174):
SIGSEGV, segmentation fault occurred


Hi,
First, sorry that the earlier patch was not dealt with. The ball was dropped
on our end. Seth is the point person from now on, but please cc me on all
correspondence, so I am in the loop. We will run the patch here as well, so
we hopefully will avoid future issues along these lines. In terms of Intel
compiler issues we do not use this compiler here, so these problems will
likely continue to crop up. Sorry. Finally, most modern compilers will
reorganize these kinds of do loops in a more efficient manner, so there is
always a tension between readable code and doing the compilers job for it.
Any optimizing compiler worth a damn can deal with these constructs - one
has to wonder what is going on with the intel compiler. Take care. Kennie



On Oct 19, 2006, at 3:24 PM, Ross Walker wrote:


Hi Mina,


I use the intel compilers 9.1038 (for icc and icpc) and ifort
(9.10.32). My mkl library is version 8.1.


tets.sander.DIVCON
The Run.wnmr in the water2_qmmmnmr/ dir passes
but the Run.cnmr produces the following
forrtl: severe (174): SIGSEGV, segmentation fault occurred


test.antechamber produces segementation faults as well. Specifically
inthe tp/ dir the script Run.tp produces:
Running: $AMBERHOME/exe/divcon
forrtl: severe (174): SIGSEGV, segmentation fault occurred


This has been noted before and we are aware of the problem. I believe that
it is triggered by a bug in the Intel compiler's vectorization unit that is
generating incorrect machine code. These sorts of bugs (compiler based) are
totally demoralizing if you try to fix them as they often make no logical
sense in the 'fortran space'. Often you find where the code is dying, change
something and it just dies elsewhere for no reason. Or you add a print
statement where you think it crashes and the code suddenly magically works.
That said THE 'DIVCON' PEOPLE IN FLORIDA should really be looking to find a
work around for this but I see no messages from them so as usual thus I will
have to do THEIR WORK FOR THEM. :-(.

At present I believe the following is the status of the Intel compilers with
Divcon in Amber 9:

8.1.031 Works
8.1.034 Works
9.0.033 Works
9.1.036 Segfault
9.1.037 Segfault
9.1.039 Segfault

And I guess we can now also add 9.1.038 to that list. So it doesn't look
like the problem is going to go away anytime soon. So one option would be to
see if you can get hold of one of the above compiler versions that works.
Alternatively, I have tracked down that the problem is related to
vectorization in the Intel compiler. This is enabled with -axNP. Ideally you
don't want to turn this off for all routines as it will really slow things
down. I have tracked this down to the following code:

do j=ibeg+1,inum
nnmratm = nnmratm + 1
if ( .not. calcMemRequirements) then
inmratm(nnmratm) = j
endif
enddo

Changing this to:

if (calcMemRequirements) then
do j=ibeg+1,inum
nnmratm = nnmratm + 1
enddo
else
do j=ibeg+1,inum
nnmratm = nnmratm + 1
inmratm(nnmratm) = j
enddo
end if

Fixes the problem with the crambin_qmmmnmr test case. Now leaving aside the
fact that the first example of the code is a really stupid way of coding
with respect to performance it is still unacceptable that the Intel compiler
can't cope with it. I really wish the Intel people would get their act
together.

Anyway, changing the above only fixes the first problem. The crambin_divcon
test case still sefaults. This one is due to the following piece of code:

DO 10 II=1,IIMAX
FDIAG(II) = 0.0D0
if(donmr) FIDIAG(II) = 0.0d0
10 ENDDO
DO 20 IJ=1,IJMAX
FDIAT(IJ) = 0.0D0
if(donmr) FIDIAT(IJ) = 0.0d0
20 ENDDO

Again, nothing legally wrong with it, although it is also stupid performance
wize. Note though how similar the logic is to the first problem.

Changing this to the more efficient but logically identical:

if (donmr) then
DO II=1,IIMAX
FDIAG(II) = 0.0D0
FIDIAG(II) = 0.0d0
ENDDO
DO IJ=1,IJMAX
FDIAT(IJ) = 0.0D0
FIDIAT(IJ) = 0.0d0
ENDDO
else
DO II=1,IIMAX
FDIAG(II) = 0.0D0
ENDDO
DO IJ=1,IJMAX
FDIAT(IJ) = 0.0D0
ENDDO
end if

fixes the problem. This is thus definately a bug in the Intel compiler. If I
get a chance I shall try to make a simple example and submit it to intel as
a bug report but I wouldn't hold your breath.

In the meantime I have replaced all of these loop constructs that I can find
in the divcon code and have prepared a bugfix patch file that is attached to
this email. This passes all the tests on my machine. Can you try it on your
machine and see if this works.

Copy the file to $AMBERHOME/
cd $AMBERHOME
patch -p0 <divcon_intel.bugfix
cd src
make clean
make
cd ../test
make

See if this fixes the problems. If it does please let me know and I will put
together a formal bug fix on the amber website.


PS: I build the config.h by running ./configure -nopar
-bintraj -p4 ifort_ia32 and then replace manually gcc -m32
with icc and g++ with icpc.


It is not related to your problem but you don't need to manually replace gcc
with icc and g++ with icpc when using the Intel compilers. gcc and g++
should work fine and none of the performance critical code is written in c
so why it does no harm to use the intel compilers it will make little
performance difference. I typically would only change it if I encountered
problems using gcc.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
<divcon_intel.bugfix>



Professor Kenneth M. Merz, Jr.
Department of Chemistry
Quantum Theory Project
2328 New Physics Building
PO Box 118435
University of Florida
Gainesville, Florida 32611-8435

e-mail: merz.qtp.ufl.edu
http://www.qtp.ufl.edu/~merz

Phone: 352-392-6973
FAX: 352-392-8722
Cell: 814-360-0376






Received on Sun Oct 22 2006 - 06:07:20 PDT
Custom Search