Re: amber-developers: Problems with amber / MPI

From: Kim F. Wong <kfwong.hec.utah.edu>
Date: Thu, 12 Oct 2006 15:24:44 -0600

Ross,

Sorry I've already updated my tree this morning (no more Oct. 6). Your
test bombs for my current build:
 
slickrock:test/ROSS>%/uufs/slickrock.moab/sys/pkg/mpich-intel64/bin/mpirun
-np 2 $AMBERHOME/exe/sander.MPI -O
p0_31198: (0.113281) Trying to receive a message when there are no
connections; Bailing out
p0_31198: p4_error: interrupt SIGSEGV: 11
rm_l_1_31217: (0.113281) net_send: could not write to fd=5, errno = 32
p0_31198: (2.121094) net_send: could not write to fd=4, errno = 9
p0_31198: (2.339844) net_recv failed for fd = 5
p0_31198: p4_error: net_recv read, errno = : 104
forrtl: error (69): process interrupted (SIGINT)
p0_31198: (4.339844) net_send: could not write to fd=4, errno = 32
slickrock:test/ROSS>%

The stdout from the Oct. 6 "make test.parallel" is attached. I figured
these errors were from the serial runs under MPI, so I tested the
individual entries for the parallel tests, i.e. "make test.sander.EVB",
etc. and most of the parallel tests did OK.

If I do "make test.sander.EVB" on the current tree, the tests run fine
(different from what you documented earlier), except for a couple of
failures whose origins I am aware of. ... but this could be due to my
environment (ifort 9.0 and mpich)

-Kim


Ross Walker wrote:
> Hi Kim,
>
> As far as I can make out things were broken between the 3rd October and the
> 4th October. Try the following test case with your October 6th tree:
>
> mpirun -np 2 $AMBERHOME/exe/sander.MPI -O
>
> And see if the mdout file matches the mdout.save file.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>>


export TESTsander=/uufs/hec.utah.edu/common/vothfs1b/kfwong/VOTH/ONR/amber10/exe/sander.MPI; make test.sander.BASIC
make[1]: Entering directory `/uufs/hec.utah.edu/common/vothfs1b/kfwong/VOTH/ONR/amber10/test'
cd dmp; ./Run.dmp
This test not set up for parallel
 cannot run in parallel with #residues < #pes
cd adenine; ./Run.adenine
This test not set up for parallel
 cannot run in parallel with #residues < #pes
==============================================================
cd cytosine; ./Run.cytosine
p0_20781: p4_error: interrupt SIGSEGV: 11
rm_l_1_20800: (0.109375) net_send: could not write to fd=5, errno = 32
diffing cytosine.out.save with cytosine.out
possible FAILURE: check cytosine.out.dif
==============================================================
cd nonper; ./Run.nonper
p0_20921: p4_error: interrupt SIGSEGV: 11
rm_l_1_20940: (0.117188) net_send: could not write to fd=5, errno = 32
diffing mdout.nonper.save with mdout.nonper
possible FAILURE: check mdout.nonper.dif
==============================================================
cd nonper; ./Run.nonper.belly
p0_21061: p4_error: interrupt SIGSEGV: 11
rm_l_1_21080: (0.121094) net_send: could not write to fd=5, errno = 32
diffing mdout.belly.save with mdout.belly
possible FAILURE: check mdout.belly.dif
==============================================================
cd nonper; ./Run.nonper.belly.mask
p0_21201: p4_error: interrupt SIGSEGV: 11
rm_l_1_21220: (0.121094) net_send: could not write to fd=5, errno = 32
diffing mdout.belly.mask.save with mdout.belly.mask
possible FAILURE: check mdout.belly.mask.dif
==============================================================
cd nonper; ./Run.nonper.min
p0_21341: p4_error: interrupt SIGSEGV: 11
rm_l_1_21360: (0.218750) net_send: could not write to fd=5, errno = 32
diffing mdout.min.save with mdout.min
possible FAILURE: check mdout.min.dif
==============================================================
cd nonper; ./Run.cap
p0_21480: p4_error: interrupt SIGSEGV: 11
rm_l_1_21499: (0.140625) net_send: could not write to fd=5, errno = 32
p0_21480: (2.343750) net_recv failed for fd = 6
p0_21480: p4_error: net_recv read, errno = : 104
forrtl: error (69): process interrupted (SIGINT)
p0_21480: (4.347656) net_send: could not write to fd=5, errno = 32
  ./Run.cap: Program error
make[1]: *** [test.sander.BASIC] Error 1
make[1]: Leaving directory `/uufs/hec.utah.edu/common/vothfs1b/kfwong/VOTH/ONR/amber10/test'
make: *** [test.sander.BASIC.MPI] Error 2
Received on Sun Oct 15 2006 - 06:07:03 PDT
Custom Search