Re: amber-developers: Current CVS Tests Status

From: Michael Crowley <crowley.scripps.edu>
Date: Fri, 13 Oct 2006 08:26:11 -0700

Dear Ross,
I am the culprit for the parallel failure as you see.
I do not always check in junk, though clearly here I did not check the
parallel compile. I also do not check the email often enough so I missed
this problem that I could have fixed if I had checked my email from the
reflector. Apology offered for those two things and for wasting
everyone's precious time.

I have a polite request also. Tone it down, please. Everyone makes
mistakes, even you. Please keep your comments to constructive and
to-the-point, at least on this mail list, thanks.

My 2 cents:
Regarding your suggestion, this is a simple mistake and happens all the
time. We do not need a major stoppage of checkins. Anyone who wants to
make a tag for a working version is welcome to do so. There was a
mechanism in place when Scott was here to check the tree on all
platforms, and for parallel and serial on as many platforms as possible
every night. It still needed work to take out more human intervention,
so it was more work than anyone volunteered to supply. If you want to
start that up again, where all tests on multiple platforms for serial
and parallel are run, that would be great. Then we find the problems
immediately without stopping any checking-in.
Comments on that mechanism from Scott or Dave welcome.

all the best,
Mike


Ross Walker wrote:
> Hi All,
>
> Can I suggest that we pick a date in the near future and stop ALL CVS
> checkins except bugfixes and then we work to get to a state where all the
> code works and all of the test pass - both in serial AND in parallel. I
> think we should then setup a machine that checks out the CVS tree everynight
> and runs the full test suite both in serial and parallel and reports any
> problems. I have a machine that we can use for this.
>
> Then if the next morning we find that somebodies changes broke things in
> either serial or parallel we can unwind those changes.
>
> Comments?
>
> Can I also 'politely' ask that before ANYBODY checks anything into the cvs
> tree they test everything thoroughly. This means you do a FULL build and run
> all test cases both in SERIAL AND IN PARALLEL... somebody's recent changes
> have completely hosed sander in parallel and I am now wasting my morning
> trying to find out how it was broken so that I can fix it and get on with
> adding some more parallel code. A simple check before changes are committed
> would save me from having to do this. Be warned, if I work out how it was
> broken, fix it and then it gets broken again by some new changes I might
> just unilaterally remove these changes. Followed by castration with a rusty
> spoon without anaesthetic ;-)
>
> Alternatively if you know you were the one that broke it you are welcome to
> send me a grovelling apology... ;-)
>
> For those that are interested the following is the current status of the
> amber 10 cvs tree as of 10am on 12th Oct 2006:
>
> Pentium EM64T - Intel Fce v9.1.039 (latest version) + MKL 8.0.2 (latest
> version)
>
> Serial
> ------
> ./configure -static ifort_x86_64
> Compilation: Everything builds correctly with only F95 compatibility
> warnings.
> Test failures (non-roundoff differences):
> umbrella - "chi_vs_t" file has Infinity in second column
> LES - output_addles.dif - All the addles tests fail because addles now
> requires a new style prmtop file. Also for some strange reason the output
> files from the test are checked into the cvs tree as well as the saved ones.
> pimd_spcfw - spcfw_pimd.out Large differences in Ewald error estimate
> pimd_spcfw - spcfw_nscm.out Large differences in Ewald error estimate
> pimd_pme - pimd_qmewald2.out Large differences in Ewald error estimate &
> Trajectory diverges from test case
> crambin_qmmmnmr - Sander.DIVCON Segfaults (no surprises here)
> crambin - sander.DIVCON segfaults
> lysine_PM3 - minimization sander.DIVCON segfaults
> lysine_PM3 - MD sander.DIVCON segfaults
> lysine_AM1 - MD sander.DIVCON segfaults
> 2pk4 - sander.DIVCON segfaults
> antechamber - all tests producing a mol2 file fail because the charge column
> precision has been changed.
> antechamber - all tests producing prepi files fail because the format of the
> last column has been changed.
>
> Parallel (mpich2 v1.0.3)
> --------
> ./configure -static -mpich2 ifort_x86_64
> Compilation:
> Test failures (mpirun -np 1) [1 processor]:
> umbrella - "chi_vs_t" file has Infinity in second column
> jar_multi - Quits as the run script is not smart enough to figure out that
> there are not enough processors.
> ti_eth2meth_gas - All tests fail as the run scripts are not smart enough to
> figure out if there are the correct number of processors.
> ti_ggcc - All tests fail as the run scripts are not smart enough to figure
> out if there are the correct number of processors.
> evb tests - All tests fail as the run scripts are not smart enough to figure
> out if there are the correct number of processors.
> REM tests - All tests fail as the run scripts are not smart enough to figure
> out if there are the correct number of processors.
> TI tests - All tests fail as the run scripts are not smart enough to figure
> out if there are the correct number of processors.
> addles - fails due to command not found - parallel build does not build
> addles. So should we remove this test for parallel?
> spcfw_pimd - Test fails due to failure to open spcfw_pimd.top since addles
> was missing and this file was not found. Does it make sense to have a second
> test depend on the successful results of a first test???
> pimd_pme - Large differences in Ewald error estimate and trajectory
> diverges.
>
>
> Note: We need to come up with a better way of checking in certain scripts
> how many processors the run is. Or maybe change the exit codes so the test
> cases just carry on running. Some scripts at the moment check the value of
> awk $3 of DO_PARALLEL but this is extremely dangerous since it is totally
> machine and mpi implementation dependent. Not all parallel run lines have
> the 3rd column as the number of mpi threads.
>
>
> Tests (mpirun -np 2) [2 processors]:
> Pretty much all tests are completely busted. Several give errors along the
> lines of:
>
> cd nonper; ./Run.nonper
>
> [cli_1]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> [cli_0]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(696)........................: MPI_Allreduce(sbuf=0x114d580,
> rbuf=0x114d5a0, count=3, MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD)
> failed
> MPIR_Allreduce(285).......................:
> MPIC_Sendrecv(161)........................:
> MPIC_Wait(321)............................:
> MPIDI_CH3_Progress_wait(198)..............: an error occurred while handling
> an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(422):
> MPIDU_Socki_handle_read(649)..............: connection failure
> (set=0,sock=2,errno=104:(strerror() not found))
> rank 1 in job 199 caffeine.sdsc.edu_39729 caused collective abort of all
> ranks
> exit status of rank 1: return code 1
> Coordinate resetting (SHAKE) cannot be accomplished,
> deviation is too large
> NITER, NIT, LL, I and J are : 0 0 293 575 576
>
> Note: This is usually a symptom of some deeper
> problem with the energetics of the system.
> rank 0 in job 199 caffeine.sdsc.edu_39729 caused collective abort of all
> ranks
> exit status of rank 0: return code 131
>
> Others just give completely the wrong answer.
>
> I am currently looking into how the parallel implementation has been busted.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>

-- 
-----------------------------------------------------------------
Physical mail:   Dr. Michael F. Crowley
                  Department of Molecular Biology, TPC6
                  The Scripps Research Institute
                  10550 North Torrey Pines Road
                  La Jolla, California 92037
Electronic mail: crowley.scripps.edu
Telephone:         858/784-9290
Fax:               858/784-8688
-----------------------------------------------------------------
Received on Sun Oct 15 2006 - 06:07:10 PDT
Custom Search