Re: amber-developers: RE: AMBER: Amber9 serial installation -tests- severe (174): SIGSEGV, segmentation fault occurred

From: Scott Brozell <sbrozell.scripps.edu>
Date: Fri, 20 Oct 2006 03:18:59 -0700

Hi,

I have several comments. But I offer the summary first under
the category of free advice: act professionally, write correct and
readable code, do not write micro-optimized code, profile and if
necessary optimize.

1. Probably we can all agree that correctness and readability are
important goals for software construction.

2. There may be many other goals (robustness, efficiency, maintainability,
extensibility, security,...). Usually tradeoffs will exist, and in this
regard computer programming is an art.

3. The best advice regarding efficiency is to initially ignore it. Then
slowly crank up the compiler optimization level while verifying correctness.
If performance is still inadequate then profile the program with typical
inputs and examine the hot spots. The biggest gains usually come from
redesigning with better algorithms. Finally as the last resort consider
special libraries, specifc compiler optimizations, and code tuning (ie,
rewriting the code in the manner of speed_test1, speed_test2, etc).
I also suggest seeking the advice of an optimization expert.
The main point is to determine what if anything needs to be optimized
before beginning optimization, and in this regard computer programming
can be a science. See the references for more info.

4. The code snippets speed_test1, speed_test2, and speed_test3 are
classic examples of micro-optimized code. Thier sole purpose should be to
demonstrate one type of code tuning. Any one of those three versions
may yield faster code on some platform and for some application.
But the most likely situation is that all 3 will have no effect on
performance because they are not in the hot spot.

5. I would like to see the profile data for the original application
(divcon) with each of these 3 speed_test snippets.

6. Modern compilers are one of the great successes of computer science and
engineering. They do a fantastic job of optimization. Consider that
most compilers can now perform most of the optimizations
in Bentley's classic treatise "Writing Efficient Code".
See the sgi mips man pages for ipa and lno.

7. Compilers are not perfect, and there are significant differences across
computer architectures and vendors. Learn how to use compilers effectively;
-O3 is just the tip of the iceberg. Also heed compiler caveats with
respect to aggressive optimizations.

8. The situation with respect to if-statements can be especially complicated
because of architectures with inherent parallelism and branch prediction.
In some cases removing an if-statement from a hot spot will have no
effect on performance. This is most probable inside loops with substantial
operation counts, which is not the case for the speed_tests.

9. I wouldnt be surprised to find that some compilers have specific
optimizations to transform from speed_test1 to speed_test3. However,
I have considerable experience with the SGI MIPS compilers and cannot
remember such an option. The problem is that while the if-conditional
is a loop invariant, the if-operand is not. Thus, the optimization
must duplicate the loop(s). Here the code expansion is small, but it
could be large. To do it right a user controllable max_expansion_length
should exist; it may be that over many applications this
optimization is infrequent and thus not worth the complexity.

10. For me one obvious tuning of speed_test1 is to remove the if, period:

speed_test0
-----------------
  ncount=1
  do i = 1, iloop
    do j = 1, iloop
      array1(j) = ncount
      ! the untuned original:
      ! if (i_do_extra>0) array2(j) = ncount
      ! tuned by removing the if
      array2(j) = ncount
      ncount=ncount+1
    end do
  end do

My guess is that this will give good performance on many platforms,
and it avoids code expansion (but it could be incorrect depending on the
context of a real program and it could be much slower on architectures
without inherent parallelism, ie, vectorization, pipelining, etc, but the
tend is towards more inherent parallelism).

The other obvious tune is to try array syntax:

speed_test90
-----------------
  ncount=1
  do i = 1, iloop
    do j = 1, iloop
      array1(j) = ncount
      ncount=ncount+1
    end do
  end do
if (i_do_extra>0) array2 = array1

Dave and I started using array syntax in 2003. I don't know if it has
caught on, but the result here sure feels right.

11. For entertainment, on an sgi
CPU: MIPS R10000 Processor Chip Revision: 3.4
FPU: MIPS R10010 Floating Point Chip Revision: 0.0
MIPSpro Compilers: Version 7.3.1.2m

iloop = 50,000 i_do_extra = 0
Time in seconds
speed test -O0 -O2 -O3
0 290.7 56.6 0.000
1 191.9 43.3 24.1
2 195.9 30.4 31.9
3 187.4 30.1 0.000
90 - 30.2 0.000

Yes those 0.000 are correct and are due to loop pipelining:
iloop = 50,000,000
speed test -O0 -O2 -O3
0 - - 44.0
3 - - 45.5 i_do_extra=1
3 - - 2.2 i_do_extra=0
90 - - 45.6 i_do_extra=1
90 - - 2.2 i_do_extra=0

It is satisfying that the array version is correct, readable,
efficient, and unbloated.

12. Let me second Mike Crowley's request: Tone it down, please.
Especially when it is on the very public amber.scripps.edu


Happy computing,
Scott

On Thu, 19 Oct 2006, Ross Walker wrote:

> Hi Kennie and others.
>
> >Finally, most modern compilers will reorganize these kinds of do loops in a
> more efficient manner,
> >so there is always a tension between readable code and doing the compilers
> job for it. Any optimizing
> >compiler worth a damn can deal with these constructs
>
> You would indeed think that the compiler should optimise out the if
> statement but in my experience this is not the case and in fact most modern
> compilers are pretty aweful when it comes to optimising things like this and
> you have to do almost all the work for the compiler. E.g. for example take
> the following test case that I have attached to this email. There are 3
> files which differ in the following loop:
>
> speed_test1
> -----------------
> ncount=1
> do i = 1, iloop
> do j = 1, iloop
> array1(j) = ncount
> if (i_do_extra>0) array2(j) = ncount
> ncount=ncount+1
> end do
> end do
>
> speed_test2
> -----------------
> do i = 1, iloop
> if (i_do_extra>0) then
> do j = 1, iloop
> array1(j) = ncount
> array2(j) = ncount
> ncount=ncount+1
> end do
> else
> do j = 1, iloop
> array1(j) = ncount
> ncount=ncount+1
> end do
> end if
> end do
>
> speed_test3
> -----------------
> if (i_do_extra>0) then
> do i = 1, iloop
> do j = 1, iloop
> array1(j) = ncount
> array2(j) = ncount
> ncount=ncount+1
> end do
> end do
> else
> do i = 1, iloop
> do j = 1, iloop
> array1(j) = ncount
> ncount=ncount+1
> end do
> end do
> end if
>
> So in speed_test 1 I have the if statement inside the loop in the same
> fashion that the divcon code had it. In case 2 I have factored the if
> statement out of the inner loop. And in case 3 I have factored it out of
> both loops.
>
> Now you would think that any compiler worth it's salt would spot this and be
> able to optimise out the if statement. And in fact it probably does.
> However, most modern cpus have vector single instruction multiple data
> registers. MMX for integer, SSE for single precision FP and SSE2 for double
> precision FP. The filling of array1 and array2 can be vectorised in the case
> of speed_Test3 and speed_test2 but can't be in speed_test1. Although
> potentially the compiler should factor the if statement out of the loop and
> re-write it internally as two loops that can be vectorised. However, I have
> never found a compiler that does this. Also ultimately a human could also
> realize that the do i loop is almost entirely redundant and the array1 and
> array2 arrays only actually need to be filled on the last j loop. We can let
> the compiler off on this point though as it has to assume that some other
> thread might be reading that info while it is doing other things. And that
> type of modification of code might be considered one step beyond simply
> optimising. Most compilers (all compilers?) can't even cope with the
> simplest level of re-writing though.
>
> E.g. Here are the timings for 2 modes of operation running on my 3.2GHz
> Pentium-D.
>
> 1) iloop = 50000 and i_do_extra=1
> 2) iloop = 50000 and i_do_extra=0
>
> If you want to run the code for yourself you just compile it with the
> compiler options you want and run it with ./a.out 50000 1
> where arg1 = iloop and arg2 = i_do_extra.
>
> I compiled this code in 3 ways. Firstly with "g95 -O3", then with "ifort
> -O3" and finally with "ifort -O3 -xP". The -xP adds SSE2 and MMX
> vectorisation and indeed the compiler (as I expected) vectorized both loops
> in test case 2 and 3 but couldn't do any vectorization in example 1. Thus
> you get very different peformance numbers as follows:
>
> i_do_extra = 1 iloop = 50,000
> Time in seconds
> speed test g95_-O3 ifort_-O3 ifort_-O3_-axNP
> 1 8.4216 5.0816 5.0944
> 2 6.1729 5.0688 4.2752
> 3 6.1373 5.1072 4.2624
>
> i_do_extra = 0
> Time in seconds
> speed test g95_-O3 ifort_-O3 ifort_-O3_-axNP
> 1 4.8720 2.4448 2.3680
> 2 4.8564 2.1888 1.9840
> 3 4.8556 2.2016 1.9840
>
> So as you see in all cases the optimum code is to have the if statement
> manually factored out of both loops. In fact with the Intel compiler at full
> optimization this is
> over 16% faster in both cases whether array2 is filled or not. Hence having
> the if statement inside the inner loop really hurts you and compilers do not
> fix this. Note I see the same behaviour with gfortran and with IBM's xlf90.
>
> I see this far too often. People make assumptions of what compilers will do
> based on heresay (and common sense ;-) ) but never actually stop and check
> if there thinking is true.
>
> The moral of this story... "Don't trust anything you hear about compilers
> and subject everything to the scientific method." The best thing you can do
> is code as if the compiler is completely stupid and then you are less likely
> to go wrong.
>
> The buggyness of the Intel compiler is another story altogether.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
> available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
> _____
>
> From: Ken Merz [mailto:merz.qtp.ufl.edu]
> Sent: Thursday, October 19, 2006 14:44
> To: David A. Case; Ross Walker
> Cc: Bing Wang; Kenneth Ayers; Seth Hayik
> Subject: Re: AMBER: Amber9 serial installation -tests- severe (174):
> SIGSEGV, segmentation fault occurred
>
>
> Hi,
> First, sorry that the earlier patch was not dealt with. The ball was dropped
> on our end. Seth is the point person from now on, but please cc me on all
> correspondence, so I am in the loop. We will run the patch here as well, so
> we hopefully will avoid future issues along these lines. In terms of Intel
> compiler issues we do not use this compiler here, so these problems will
> likely continue to crop up. Sorry. Finally, most modern compilers will
> reorganize these kinds of do loops in a more efficient manner, so there is
> always a tension between readable code and doing the compilers job for it.
> Any optimizing compiler worth a damn can deal with these constructs - one
> has to wonder what is going on with the intel compiler. Take care. Kennie
>
>
>
> On Oct 19, 2006, at 3:24 PM, Ross Walker wrote:
>
>
> Hi Mina,
>
>
> I use the intel compilers 9.1038 (for icc and icpc) and ifort
> (9.10.32). My mkl library is version 8.1.
>
>
> tets.sander.DIVCON
> The Run.wnmr in the water2_qmmmnmr/ dir passes
> but the Run.cnmr produces the following
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>
>
> test.antechamber produces segementation faults as well. Specifically
> inthe tp/ dir the script Run.tp produces:
> Running: $AMBERHOME/exe/divcon
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>
>
> This has been noted before and we are aware of the problem. I believe that
> it is triggered by a bug in the Intel compiler's vectorization unit that is
> generating incorrect machine code. These sorts of bugs (compiler based) are
> totally demoralizing if you try to fix them as they often make no logical
> sense in the 'fortran space'. Often you find where the code is dying, change
> something and it just dies elsewhere for no reason. Or you add a print
> statement where you think it crashes and the code suddenly magically works.
> That said THE 'DIVCON' PEOPLE IN FLORIDA should really be looking to find a
> work around for this but I see no messages from them so as usual thus I will
> have to do THEIR WORK FOR THEM. :-(.
>
> At present I believe the following is the status of the Intel compilers with
> Divcon in Amber 9:
>
> 8.1.031 Works
> 8.1.034 Works
> 9.0.033 Works
> 9.1.036 Segfault
> 9.1.037 Segfault
> 9.1.039 Segfault
>
> And I guess we can now also add 9.1.038 to that list. So it doesn't look
> like the problem is going to go away anytime soon. So one option would be to
> see if you can get hold of one of the above compiler versions that works.
> Alternatively, I have tracked down that the problem is related to
> vectorization in the Intel compiler. This is enabled with -axNP. Ideally you
> don't want to turn this off for all routines as it will really slow things
> down. I have tracked this down to the following code:
>
> do j=ibeg+1,inum
> nnmratm = nnmratm + 1
> if ( .not. calcMemRequirements) then
> inmratm(nnmratm) = j
> endif
> enddo
>
> Changing this to:
>
> if (calcMemRequirements) then
> do j=ibeg+1,inum
> nnmratm = nnmratm + 1
> enddo
> else
> do j=ibeg+1,inum
> nnmratm = nnmratm + 1
> inmratm(nnmratm) = j
> enddo
> end if
>
> Fixes the problem with the crambin_qmmmnmr test case. Now leaving aside the
> fact that the first example of the code is a really stupid way of coding
> with respect to performance it is still unacceptable that the Intel compiler
> can't cope with it. I really wish the Intel people would get their act
> together.
>
> Anyway, changing the above only fixes the first problem. The crambin_divcon
> test case still sefaults. This one is due to the following piece of code:
>
> DO 10 II=1,IIMAX
> FDIAG(II) = 0.0D0
> if(donmr) FIDIAG(II) = 0.0d0
> 10 ENDDO
> DO 20 IJ=1,IJMAX
> FDIAT(IJ) = 0.0D0
> if(donmr) FIDIAT(IJ) = 0.0d0
> 20 ENDDO
>
> Again, nothing legally wrong with it, although it is also stupid performance
> wize. Note though how similar the logic is to the first problem.
>
> Changing this to the more efficient but logically identical:
>
> if (donmr) then
> DO II=1,IIMAX
> FDIAG(II) = 0.0D0
> FIDIAG(II) = 0.0d0
> ENDDO
> DO IJ=1,IJMAX
> FDIAT(IJ) = 0.0D0
> FIDIAT(IJ) = 0.0d0
> ENDDO
> else
> DO II=1,IIMAX
> FDIAG(II) = 0.0D0
> ENDDO
> DO IJ=1,IJMAX
> FDIAT(IJ) = 0.0D0
> ENDDO
> end if
>
> fixes the problem. This is thus definately a bug in the Intel compiler. If I
> get a chance I shall try to make a simple example and submit it to intel as
> a bug report but I wouldn't hold your breath.
>
> In the meantime I have replaced all of these loop constructs that I can find
> in the divcon code and have prepared a bugfix patch file that is attached to
> this email. This passes all the tests on my machine. Can you try it on your
> machine and see if this works.
>
> Copy the file to $AMBERHOME/
> cd $AMBERHOME
> patch -p0 <divcon_intel.bugfix
> cd src
> make clean
> make
> cd ../test
> make
>
> See if this fixes the problems. If it does please let me know and I will put
> together a formal bug fix on the amber website.
>
>
> PS: I build the config.h by running ./configure -nopar
> -bintraj -p4 ifort_ia32 and then replace manually gcc -m32
> with icc and g++ with icpc.
>
>
> It is not related to your problem but you don't need to manually replace gcc
> with icc and g++ with icpc when using the Intel compilers. gcc and g++
> should work fine and none of the performance critical code is written in c
> so why it does no harm to use the intel compilers it will make little
> performance difference. I typically would only change it if I encountered
> problems using gcc.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
> <divcon_intel.bugfix>
>
>
>
> Professor Kenneth M. Merz, Jr.
> Department of Chemistry
> Quantum Theory Project
> 2328 New Physics Building
> PO Box 118435
> University of Florida
> Gainesville, Florida 32611-8435
>
> e-mail: merz.qtp.ufl.edu
> http://www.qtp.ufl.edu/~merz
>
> Phone: 352-392-6973
> FAX: 352-392-8722
> Cell: 814-360-0376
>
>
>
>
>
>
Received on Sun Oct 22 2006 - 06:07:23 PDT
Custom Search