amber-developers: Current CVS Tests Status

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 12 Oct 2006 12:22:49 -0700

Hi All,

Can I suggest that we pick a date in the near future and stop ALL CVS
checkins except bugfixes and then we work to get to a state where all the
code works and all of the test pass - both in serial AND in parallel. I
think we should then setup a machine that checks out the CVS tree everynight
and runs the full test suite both in serial and parallel and reports any
problems. I have a machine that we can use for this.

Then if the next morning we find that somebodies changes broke things in
either serial or parallel we can unwind those changes.

Comments?

Can I also 'politely' ask that before ANYBODY checks anything into the cvs
tree they test everything thoroughly. This means you do a FULL build and run
all test cases both in SERIAL AND IN PARALLEL... somebody's recent changes
have completely hosed sander in parallel and I am now wasting my morning
trying to find out how it was broken so that I can fix it and get on with
adding some more parallel code. A simple check before changes are committed
would save me from having to do this. Be warned, if I work out how it was
broken, fix it and then it gets broken again by some new changes I might
just unilaterally remove these changes. Followed by castration with a rusty
spoon without anaesthetic ;-)

Alternatively if you know you were the one that broke it you are welcome to
send me a grovelling apology... ;-)

For those that are interested the following is the current status of the
amber 10 cvs tree as of 10am on 12th Oct 2006:

Pentium EM64T - Intel Fce v9.1.039 (latest version) + MKL 8.0.2 (latest
version)

Serial
------
./configure -static ifort_x86_64
Compilation: Everything builds correctly with only F95 compatibility
warnings.
Test failures (non-roundoff differences):
umbrella - "chi_vs_t" file has Infinity in second column
LES - output_addles.dif - All the addles tests fail because addles now
requires a new style prmtop file. Also for some strange reason the output
files from the test are checked into the cvs tree as well as the saved ones.
pimd_spcfw - spcfw_pimd.out Large differences in Ewald error estimate
pimd_spcfw - spcfw_nscm.out Large differences in Ewald error estimate
pimd_pme - pimd_qmewald2.out Large differences in Ewald error estimate &
Trajectory diverges from test case
crambin_qmmmnmr - Sander.DIVCON Segfaults (no surprises here)
crambin - sander.DIVCON segfaults
lysine_PM3 - minimization sander.DIVCON segfaults
lysine_PM3 - MD sander.DIVCON segfaults
lysine_AM1 - MD sander.DIVCON segfaults
2pk4 - sander.DIVCON segfaults
antechamber - all tests producing a mol2 file fail because the charge column
precision has been changed.
antechamber - all tests producing prepi files fail because the format of the
last column has been changed.

Parallel (mpich2 v1.0.3)
--------
./configure -static -mpich2 ifort_x86_64
Compilation:
Test failures (mpirun -np 1) [1 processor]:
umbrella - "chi_vs_t" file has Infinity in second column
jar_multi - Quits as the run script is not smart enough to figure out that
there are not enough processors.
ti_eth2meth_gas - All tests fail as the run scripts are not smart enough to
figure out if there are the correct number of processors.
ti_ggcc - All tests fail as the run scripts are not smart enough to figure
out if there are the correct number of processors.
evb tests - All tests fail as the run scripts are not smart enough to figure
out if there are the correct number of processors.
REM tests - All tests fail as the run scripts are not smart enough to figure
out if there are the correct number of processors.
TI tests - All tests fail as the run scripts are not smart enough to figure
out if there are the correct number of processors.
addles - fails due to command not found - parallel build does not build
addles. So should we remove this test for parallel?
spcfw_pimd - Test fails due to failure to open spcfw_pimd.top since addles
was missing and this file was not found. Does it make sense to have a second
test depend on the successful results of a first test???
pimd_pme - Large differences in Ewald error estimate and trajectory
diverges.


Note: We need to come up with a better way of checking in certain scripts
how many processors the run is. Or maybe change the exit codes so the test
cases just carry on running. Some scripts at the moment check the value of
awk $3 of DO_PARALLEL but this is extremely dangerous since it is totally
machine and mpi implementation dependent. Not all parallel run lines have
the 3rd column as the number of mpi threads.


Tests (mpirun -np 2) [2 processors]:
Pretty much all tests are completely busted. Several give errors along the
lines of:

cd nonper; ./Run.nonper

[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
[cli_0]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(696)........................: MPI_Allreduce(sbuf=0x114d580,
rbuf=0x114d5a0, count=3, MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD)
failed
MPIR_Allreduce(285).......................:
MPIC_Sendrecv(161)........................:
MPIC_Wait(321)............................:
MPIDI_CH3_Progress_wait(198)..............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(422):
MPIDU_Socki_handle_read(649)..............: connection failure
(set=0,sock=2,errno=104:(strerror() not found))
rank 1 in job 199 caffeine.sdsc.edu_39729 caused collective abort of all
ranks
  exit status of rank 1: return code 1
     Coordinate resetting (SHAKE) cannot be accomplished,
     deviation is too large
     NITER, NIT, LL, I and J are : 0 0 293 575 576

     Note: This is usually a symptom of some deeper
     problem with the energetics of the system.
rank 0 in job 199 caffeine.sdsc.edu_39729 caused collective abort of all
ranks
  exit status of rank 0: return code 131

Others just give completely the wrong answer.

I am currently looking into how the parallel implementation has been busted.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
Received on Thu Oct 12 2006 - 20:36:16 PDT
Custom Search