With mpi implementations like mpich running on clusters, there definitely
can be problems with hung processes from previous jobs that failed; I have
told myself I should look into this hard sometime, but have a nagging
suspicion that we are really not doing anything all that wrong. I presume
others see this also; I observe it testing with gigabit ethernet/mpich,
either sander or pmemd. I also saw scads of problems when we had myrinet
running on a cluster at UNC. The bloody hardware failed and had to be
reset
a bunch as it got older. I never got into the sys admin on any of this,
so
I don't know what the underlying myrinet issues were. With mpich on
gigabit
ethernet, I know that it is possible to get system resources tied up and
impact performance of, or hang other processes. This has to do with
deadlocking on memory available for buffers, if I remember correctly. I
tend to blame the sys config/hardware because I run on so many large
installations where there are no problems at all. All of which probably
doesn't help much...
Regards - Bob
----- Original Message -----
From: "David A. Case" <case.scripps.edu>
To: <amber-developers.scripps.edu>
Sent: Monday, March 27, 2006 7:42 PM
Subject: Re: amber-developers: GB/LES GB1 diffcoords FP exception on Sun
> On Mon, Mar 27, 2006, carlos wrote:
>
>> that looks fine to me.
>
> OK..I'll check it in.
>
>> is the REMD code still failing or only the hybrid one?
>
> Everything with REMD is now working for me. I had a problem with
parallel
> targetting MD, but then that went away; Mike now has a problem with a
> QM/MM
> test case, but I can't reproduce it.,,, etc.
>
> I think we are seeing intermittent MPI failures; maybe this occurs when
> running lots of short test cases -- cold it be that somehow the MPI
daemon
> doesn't really get cleaned up after one test before the next test has
> begun?
> Have you ever seen any behavior like this?
>
> Anyway, at least lots of people are testing things. What you might do
> is try the latest (CVS) code on something other than a test case -- some
> job like you are running for real production....especially a replica
> exchange
> job.
>
> ...thanks!...dave
>
>
Received on Wed Apr 05 2006 - 23:49:34 PDT