Thanks for testing. That helps.
> I wouldn't be surprised if it did turn out to be a FS issue - even simple
> NFS mount points can get really wacky sometimes (same file has different
> contents/attributes on different computers etc). For now I would say avoid
> using ptraj over any network filesystem in parallel.
...or sander, pmemd...
The only thing that doesn't give us corrupted output, so far, is a whacked together gnu compile of serial sander. (I haven't had time/patience/sanity to compile a recent/reliable gcc from scratch over a link to Ireland.) A different ifort doesn't fix in preliminary results.
We should know in the next day or three if writing to the root (ext3) filesystem on a compute node works. We have to collect a lot of data to test it well. I really want to figure this out... is maddening to have data just disappear.
:-) Lachele
--
B. Lachele Foley, PhD '92,'02
Assistant Research Scientist
Complex Carbohydrate Research Center, UGA
706-542-0263
lfoley.ccrc.uga.edu
----- Original Message -----
From: Daniel Roe
[mailto:daniel.r.roe.gmail.com]
To: AMBER Developers Mailing List
[mailto:amber-developers.ambermd.org]
Sent: Tue, 16 Mar 2010 22:34:55
-0400
Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First
AmberTools release candidate
> On Tue, Mar 16, 2010 at 9:54 PM, Lachele Foley <lfoley.ccrc.uga.edu> wrote:
>
> > Are you getting slowdown or hangs forever? For me, it never completes --
> > or, at least, doesn't complete after 45 minutes on four processors.
> > Compared to two seconds, that's close enough to forever for me.
> >
>
> Slowdowns only - my tests compelte. Here are some timing results over an NFS
> filesystem for the first part of the ptraj_comprehensive test case (the part
> that uses the ptraj.in input file):
> Single processor:
> -rwxr-xr-x 1 droe case 2001557 Mar 16 16:12
> /home/droe/Amber/CVS/amber11/bin/ptraj
> 4 seconds.
> Timings...
>
> -------------------------------
> | Check Input Time | 0.000 |
> | Input Time | 0.000 |
> | Output Time | 0.010 |
> | Action Time | 4.040 |
> |------------------|----------|
> | Total Time | 4.050 |
> -------------------------------
>
> Pretty consistent - run takes 4 seconds, which agrees with the internal
> timings.
> 2 processor:
> -rwxr-xr-x 1 droe case 3098145 Mar 16 15:21
> /home/droe/Amber/CVS/amber11/bin/ptraj.MPI
> case1
> time for 1 loops = 0.000419139862061 seconds
> 13 seconds.
> Timings...
>
> ------------------------------------------
> | Rank | 0 | 1 |
> |------------------|----------|----------|
> | Check Input Time | 0.010 | 0.009 |
> | Input Time | 1.346 | 1.131 |
> | Output Time | 1.013 | 1.176 |
> | Action Time | 2.801 | 2.853 |
> |------------------|----------|----------|
> | Total Time | 5.170 | 5.169 |
> ------------------------------------------
>
> -----------------------------------------------------
> | | Average | Longest | Total |
> |------------------|----------|----------|----------|
> | Check Input Time | 0.009 | 0.010 | 0.019 |
> | Input Time | 1.239 | 1.346 | 2.477 |
> | Output Time | 1.094 | 1.176 | 2.188 |
> | Action Time | 2.827 | 2.853 | 5.655 |
> |------------------|----------|----------|----------|
> | Total Time | 5.169 | 5.385 | 10.339 |
> -----------------------------------------------------
>
> Note how even though the internal timings for the multiprocessor run are
> only a little slower, the actual runtime (13 s, first line) is over twice
> that, which implies communication issues. Now take a look at a run on a
> local disk (I'm only showing 2 processors - the timings for 1 processor are
> essentially the same):
> 2 processors:
> -rwxr-xr-x 1 droe case 3098145 Mar 16 15:21
> /home/droe/Amber/CVS/amber11/bin/ptraj.MPI
> case1
> time for 1 loops = 0.00016713142395 seconds
> 2 seconds.
> Timings...
>
> ------------------------------------------
> | Rank | 0 | 1 |
> |------------------|----------|----------|
> | Check Input Time | 0.000 | 0.001 |
> | Input Time | 0.158 | 0.152 |
> | Output Time | 0.007 | 0.128 |
> | Action Time | 2.183 | 2.069 |
> |------------------|----------|----------|
> | Total Time | 2.349 | 2.350 |
> ------------------------------------------
>
> -----------------------------------------------------
> | | Average | Longest | Total |
> |------------------|----------|----------|----------|
> | Check Input Time | 0.001 | 0.001 | 0.001 |
> | Input Time | 0.155 | 0.158 | 0.310 |
> | Output Time | 0.068 | 0.128 | 0.136 |
> | Action Time | 2.126 | 2.183 | 4.252 |
> |------------------|----------|----------|----------|
> | Total Time | 2.350 | 2.470 | 4.699 |
> -----------------------------------------------------
>
> Now there is a speedup compared to 1 processor.
>
>
> > I'm not sure if you've seen bug 126, but... We've been getting corrupted
> > output files. For example, every two million characters or so, the file
> > will be missing a couple. Matt is setting up tests to use the different
> > mount points (file systems), and will run tomorrow.
> >
>
> I wouldn't be surprised if it did turn out to be a FS issue - even simple
> NFS mount points can get really wacky sometimes (same file has different
> contents/attributes on different computers etc). For now I would say avoid
> using ptraj over any network filesystem in parallel.
>
> -Dan
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Tue Mar 16 2010 - 21:00:03 PDT