Re: [AMBER-Developers] infinite ptraj.MPI, was: First AmberTools release candidate

From: Robert Duke <rduke.email.unc.edu>
Date: Wed, 17 Mar 2010 13:01:23 -0400

Hi Lachele,
I would think you can build with ifort and gcc combined for pmemd without
problems, especially if you are not dragging mkl into the mix (just
simplifies the linkage). I used to do this all the time because it was
simpler, and the parts of c in use in pmemd are really incredibly simple -
just a wrapped system call or two, and some code to determine how much
memory is being used by some records I believe. I would get the most solid
ifort I could and go from there without mkl (I have done okay with 10.1.21
myself; have not tried 23 but it may be a small enough increment to be
fine). I recollect Scott posting a list of okay ifort releases, buried
somewhere in my mail. I have not a clue what the story is with gromacs
working; could be they are not looking that hard at what is coming out;
could be it is the fortran that is the real grief, or some incorrect library
mix from using fortran. That doesn't help explain ptraj, but that is young
code without a long history of working in parallel.
Regards - Bob
----- Original Message -----
From: "Lachele Foley" <lfoley.ccrc.uga.edu>
To: "AMBER Developers Mailing List" <amber-developers.ambermd.org>
Sent: Wednesday, March 17, 2010 12:37 PM
Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First AmberTools
release candidate


Bob, thanks for the info. It helped.

There might be several issues. There's no reason there isn't a filesystem
issue -and- something mismatched about mpi buffers. So far as I know, HP
set up the MPI. It wasn't me. Will check. I can't mess with this much
more the next week or so. But, after that I'll do more.

The trouble is that Gromacs is being run frequently and doesn't give any of
these issues (haven't run it myself -- am accepting third-party info). So,
I'm having to work extra hard to show that it's most likely the system and
not bad coding in AMBER that's just now surfacing. Am keeping you in the
loop since it probably matters to you what happens on fancy new machines.

The filesystem is Lustre and all of it was put together by HP for the
purpose of doing exactly this sort of thing, so my take is that it should be
working like a charm. I think the interconnect is all infiniband except
maybe a few hardware management ports. Can't recall exactly.

I got and tried a new ifort (10.0.23), but didn't have matching icc at the
moment. Compiled sander anyway, and still get data corruption in serial and
infant mortality in sander.MPI. Pmemd wouldn't make, complaining that ifort
and icc didn't match. So, tentatively, it's not exactly the compiler.

Maybe there is a mismatch in something between the GNU OS libraries and the
Intel libraries in the code. Presumably, the SFS firm/soft-ware uses Intel,
to. Would that make any sense?

Maybe what the grad students are up to at the moment will help some, too.

Must get back to work now... Thanks for all the help, to all of you. I do
appreciate it.

:-) Lachele
--
B. Lachele Foley, PhD '92,'02
Assistant Research Scientist
Complex Carbohydrate Research Center, UGA
706-542-0263
lfoley.ccrc.uga.edu
----- Original Message -----
From: Robert Duke
[mailto:rduke.email.unc.edu]
To: AMBER Developers Mailing List
[mailto:amber-developers.ambermd.org]
Sent: Wed, 17 Mar 2010 09:35:55
-0400
Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First
AmberTools release candidate
> Hi Lachele,
> Sorry, I have not been following all this in detail, but there really 
> should
>
> be no problem with the code itself at the level of pmemd or sander.  The
> ptraj stuff I don't have a clue about, as I have not looked at it.  I 
> still
> strongly suspect some system component, be it the system/filesystem 
> itself,
> mpi, or the compiler.  If you get problems on uniprocessor runs (I don't
> remember), well that rules out mpi.  If disk access is via nfs, then all
> bets are off at various levels.  You either need local disk or a serious
> parallel file system that you can access via a good interface (so you have
> lustre I believe, but I don't know if there is something funky about how 
> it
> is connected); I have seen nfs cause serious problems, especially when it
> either is on a slow ethernet interface or if it shares the interconnect 
> used
>
> by mpi.  It is also possible to configure nfs in ways where the sync 
> between
>
> the program and the disk is looser, and I could envision that causing
> problems (I am rusty on nfs, but I have seen funky things occur before
> without a lot of effort).  On ptraj hanging, well, there are multiple ways
> to make this sort of thing happen, but two common ones have to do with 1)
> how you configure mpi buffers on your system (buffers too small on the
> system and too big in the code can cause really interesting buffer
> allocation race conditions which result in hangs; simple mpi has to be
> configured for buffer size, and the operating system itself has to also be
> configured), and 2) how you synchronize mpi activity in the code.  I think
> there is the potential for a lot of work to solve the problems you are
> having, but I could be overestimating it.   I WOULD NOT add netcdf to the
> mix at this point in time, or mkl for that matter.  The more complex 
> layers
> you pile onto this mess, the harder it is going to be to fix.  Well, that
> was all probably no help at all...
> Regards - Bob
> ----- Original Message ----- 
> From: "Lachele Foley" <lfoley.ccrc.uga.edu>
> To: "AMBER Developers Mailing List" <amber-developers.ambermd.org>
> Sent: Wednesday, March 17, 2010 9:10 AM
> Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First AmberTools
> release candidate
>
>
> > ??? Neither sander nor pmemd do parallel i/o, as far as I can see.
>
> Exacty...  That was one of my first questions to Bob D: do the file writes
> do anything fancy like leave pointers hanging mid-file, etc.?  He said 
> "just
>
> dumb-bunny appends," which is what makes sense to do.
>
> > Does your problem exist with netcdf trajectories?
>
> Haven't tried netcdf.  I guess I have a job now for the grad student who
> just offered to help...  I doubt it will matter a lot, because I've also
> seen issues, for example, in my min.o file.  But, all information helps.
>
> Regarding Ross's comment, since the filesystem I had the problem on is
> Lustre, "a real parallel filesystem", then the ptraj.MPI should be good
> there (better, even?), not hung forever.  Right?
>
>
> :-) Lachele
> --
> B. Lachele Foley, PhD '92,'02
> Assistant Research Scientist
> Complex Carbohydrate Research Center, UGA
> 706-542-0263
> lfoley.ccrc.uga.edu
>
>
> ----- Original Message -----
> From: case
> [mailto:case.biomaps.rutgers.edu]
> To: AMBER Developers Mailing List
> [mailto:amber-developers.ambermd.org]
> Sent: Wed, 17 Mar 2010 07:43:57
> -0400
> Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First
> AmberTools release candidate
>
>
> > On Tue, Mar 16, 2010, Lachele Foley wrote:
> > >
> > > > I wouldn't be surprised if it did turn out to be a FS issue - even
> > simple
> > > > NFS mount points can get really wacky sometimes (same file has
> > > > different
> > > > contents/attributes on different computers etc). For now I would say
> > avoid
> > > > using ptraj over any network filesystem in parallel.
> > >
> > > ...or sander, pmemd...
> >
> > ??? Neither sander nor pmemd do parallel i/o, as far as I can see.
> >
> > Does your problem exist with netcdf trajectories?
> >
> > ...dac
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 17 2010 - 10:30:02 PDT
Custom Search