Re: [AMBER-Developers] infinite ptraj.MPI, was: First AmberTools release candidate

From: Lachele Foley <lfoley.ccrc.uga.edu>
Date: Wed, 17 Mar 2010 12:37:18 -0400

Bob, thanks for the info. It helped.

There might be several issues. There's no reason there isn't a filesystem issue -and- something mismatched about mpi buffers. So far as I know, HP set up the MPI. It wasn't me. Will check. I can't mess with this much more the next week or so. But, after that I'll do more.

The trouble is that Gromacs is being run frequently and doesn't give any of these issues (haven't run it myself -- am accepting third-party info). So, I'm having to work extra hard to show that it's most likely the system and not bad coding in AMBER that's just now surfacing. Am keeping you in the loop since it probably matters to you what happens on fancy new machines.

The filesystem is Lustre and all of it was put together by HP for the purpose of doing exactly this sort of thing, so my take is that it should be working like a charm. I think the interconnect is all infiniband except maybe a few hardware management ports. Can't recall exactly.

I got and tried a new ifort (10.0.23), but didn't have matching icc at the moment. Compiled sander anyway, and still get data corruption in serial and infant mortality in sander.MPI. Pmemd wouldn't make, complaining that ifort and icc didn't match. So, tentatively, it's not exactly the compiler.

Maybe there is a mismatch in something between the GNU OS libraries and the Intel libraries in the code. Presumably, the SFS firm/soft-ware uses Intel, to. Would that make any sense?

Maybe what the grad students are up to at the moment will help some, too.

Must get back to work now... Thanks for all the help, to all of you. I do appreciate it.

:-) Lachele
--
B. Lachele Foley, PhD '92,'02
Assistant Research Scientist
Complex Carbohydrate Research Center, UGA
706-542-0263
lfoley.ccrc.uga.edu
----- Original Message -----
From: Robert Duke
[mailto:rduke.email.unc.edu]
To: AMBER Developers Mailing List
[mailto:amber-developers.ambermd.org]
Sent: Wed, 17 Mar 2010 09:35:55
-0400
Subject: Re: [AMBER-Developers] infinite ptraj.MPI,	was: First
AmberTools 	release candidate
> Hi Lachele,
> Sorry, I have not been following all this in detail, but there really should
> 
> be no problem with the code itself at the level of pmemd or sander.  The 
> ptraj stuff I don't have a clue about, as I have not looked at it.  I still 
> strongly suspect some system component, be it the system/filesystem itself, 
> mpi, or the compiler.  If you get problems on uniprocessor runs (I don't 
> remember), well that rules out mpi.  If disk access is via nfs, then all 
> bets are off at various levels.  You either need local disk or a serious 
> parallel file system that you can access via a good interface (so you have 
> lustre I believe, but I don't know if there is something funky about how it 
> is connected); I have seen nfs cause serious problems, especially when it 
> either is on a slow ethernet interface or if it shares the interconnect used
> 
> by mpi.  It is also possible to configure nfs in ways where the sync between
> 
> the program and the disk is looser, and I could envision that causing 
> problems (I am rusty on nfs, but I have seen funky things occur before 
> without a lot of effort).  On ptraj hanging, well, there are multiple ways 
> to make this sort of thing happen, but two common ones have to do with 1) 
> how you configure mpi buffers on your system (buffers too small on the 
> system and too big in the code can cause really interesting buffer 
> allocation race conditions which result in hangs; simple mpi has to be 
> configured for buffer size, and the operating system itself has to also be 
> configured), and 2) how you synchronize mpi activity in the code.  I think 
> there is the potential for a lot of work to solve the problems you are 
> having, but I could be overestimating it.   I WOULD NOT add netcdf to the 
> mix at this point in time, or mkl for that matter.  The more complex layers 
> you pile onto this mess, the harder it is going to be to fix.  Well, that 
> was all probably no help at all...
> Regards - Bob
> ----- Original Message ----- 
> From: "Lachele Foley" <lfoley.ccrc.uga.edu>
> To: "AMBER Developers Mailing List" <amber-developers.ambermd.org>
> Sent: Wednesday, March 17, 2010 9:10 AM
> Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First AmberTools 
> release candidate
> 
> 
> > ??? Neither sander nor pmemd do parallel i/o, as far as I can see.
> 
> Exacty...  That was one of my first questions to Bob D: do the file writes 
> do anything fancy like leave pointers hanging mid-file, etc.?  He said "just
> 
> dumb-bunny appends," which is what makes sense to do.
> 
> > Does your problem exist with netcdf trajectories?
> 
> Haven't tried netcdf.  I guess I have a job now for the grad student who 
> just offered to help...  I doubt it will matter a lot, because I've also 
> seen issues, for example, in my min.o file.  But, all information helps.
> 
> Regarding Ross's comment, since the filesystem I had the problem on is 
> Lustre, "a real parallel filesystem", then the ptraj.MPI should be good 
> there (better, even?), not hung forever.  Right?
> 
> 
> :-) Lachele
> --
> B. Lachele Foley, PhD '92,'02
> Assistant Research Scientist
> Complex Carbohydrate Research Center, UGA
> 706-542-0263
> lfoley.ccrc.uga.edu
> 
> 
> ----- Original Message -----
> From: case
> [mailto:case.biomaps.rutgers.edu]
> To: AMBER Developers Mailing List
> [mailto:amber-developers.ambermd.org]
> Sent: Wed, 17 Mar 2010 07:43:57
> -0400
> Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First
> AmberTools release candidate
> 
> 
> > On Tue, Mar 16, 2010, Lachele Foley wrote:
> > >
> > > > I wouldn't be surprised if it did turn out to be a FS issue - even
> > simple
> > > > NFS mount points can get really wacky sometimes (same file has 
> > > > different
> > > > contents/attributes on different computers etc). For now I would say
> > avoid
> > > > using ptraj over any network filesystem in parallel.
> > >
> > > ...or sander, pmemd...
> >
> > ??? Neither sander nor pmemd do parallel i/o, as far as I can see.
> >
> > Does your problem exist with netcdf trajectories?
> >
> > ...dac
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> 
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
> 
> 
> 
> 
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
> 
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 17 2010 - 10:00:04 PDT
Custom Search