Re: [AMBER-Developers] infinite ptraj.MPI, was: First AmberTools release candidate

From: Lachele Foley <>
Date: Tue, 16 Mar 2010 23:30:07 -0400

Thanks for testing. That helps.

> I wouldn't be surprised if it did turn out to be a FS issue - even simple
> NFS mount points can get really wacky sometimes (same file has different
> contents/attributes on different computers etc). For now I would say avoid
> using ptraj over any network filesystem in parallel.

...or sander, pmemd...

The only thing that doesn't give us corrupted output, so far, is a whacked together gnu compile of serial sander. (I haven't had time/patience/sanity to compile a recent/reliable gcc from scratch over a link to Ireland.) A different ifort doesn't fix in preliminary results.

We should know in the next day or three if writing to the root (ext3) filesystem on a compute node works. We have to collect a lot of data to test it well. I really want to figure this out... is maddening to have data just disappear.

:-) Lachele
B. Lachele Foley, PhD '92,'02
Assistant Research Scientist
Complex Carbohydrate Research Center, UGA
----- Original Message -----
From: Daniel Roe
To: AMBER Developers Mailing List
Sent: Tue, 16 Mar 2010 22:34:55
Subject: Re: [AMBER-Developers] infinite ptraj.MPI, was: First
AmberTools 	release candidate
> On Tue, Mar 16, 2010 at 9:54 PM, Lachele Foley <> wrote:
> > Are you getting slowdown or hangs forever?  For me, it never completes --
> > or, at least, doesn't complete after 45 minutes on four processors.
> >  Compared to two seconds, that's close enough to forever for me.
> >
> Slowdowns only - my tests compelte. Here are some timing results over an NFS
> filesystem for the first part of the ptraj_comprehensive test case (the part
> that uses the input file):
> Single processor:
> -rwxr-xr-x 1 droe case 2001557 Mar 16 16:12
> /home/droe/Amber/CVS/amber11/bin/ptraj
> 4 seconds.
> Timings...
> -------------------------------
> | Check Input Time |    0.000 |
> | Input Time       |    0.000 |
> | Output Time      |    0.010 |
> | Action Time      |    4.040 |
> |------------------|----------|
> | Total Time       |    4.050 |
> -------------------------------
> Pretty consistent - run takes 4 seconds, which agrees with the internal
> timings.
> 2 processor:
> -rwxr-xr-x 1 droe case 3098145 Mar 16 15:21
> /home/droe/Amber/CVS/amber11/bin/ptraj.MPI
> case1
> time for 1 loops = 0.000419139862061 seconds
> 13 seconds.
> Timings...
> ------------------------------------------
> | Rank             |     0    |     1    |
> |------------------|----------|----------|
> | Check Input Time |    0.010 |    0.009 |
> | Input Time       |    1.346 |    1.131 |
> | Output Time      |    1.013 |    1.176 |
> | Action Time      |    2.801 |    2.853 |
> |------------------|----------|----------|
> | Total Time       |    5.170 |    5.169 |
> ------------------------------------------
> -----------------------------------------------------
> |                  | Average  | Longest  |  Total   |
> |------------------|----------|----------|----------|
> | Check Input Time |    0.009 |    0.010 |    0.019 |
> | Input Time       |    1.239 |    1.346 |    2.477 |
> | Output Time      |    1.094 |    1.176 |    2.188 |
> | Action Time      |    2.827 |    2.853 |    5.655 |
> |------------------|----------|----------|----------|
> | Total Time       |    5.169 |    5.385 |   10.339 |
> -----------------------------------------------------
> Note how even though the internal timings for the multiprocessor run are
> only a little slower, the actual runtime (13 s, first line) is over twice
> that, which implies communication issues. Now take a look at a run on a
> local disk (I'm only showing 2 processors - the timings for 1 processor are
> essentially the same):
> 2 processors:
> -rwxr-xr-x 1 droe case 3098145 Mar 16 15:21
> /home/droe/Amber/CVS/amber11/bin/ptraj.MPI
> case1
> time for 1 loops = 0.00016713142395 seconds
> 2 seconds.
> Timings...
> ------------------------------------------
> | Rank             |     0    |     1    |
> |------------------|----------|----------|
> | Check Input Time |    0.000 |    0.001 |
> | Input Time       |    0.158 |    0.152 |
> | Output Time      |    0.007 |    0.128 |
> | Action Time      |    2.183 |    2.069 |
> |------------------|----------|----------|
> | Total Time       |    2.349 |    2.350 |
> ------------------------------------------
> -----------------------------------------------------
> |                  | Average  | Longest  |  Total   |
> |------------------|----------|----------|----------|
> | Check Input Time |    0.001 |    0.001 |    0.001 |
> | Input Time       |    0.155 |    0.158 |    0.310 |
> | Output Time      |    0.068 |    0.128 |    0.136 |
> | Action Time      |    2.126 |    2.183 |    4.252 |
> |------------------|----------|----------|----------|
> | Total Time       |    2.350 |    2.470 |    4.699 |
> -----------------------------------------------------
> Now there is a speedup compared to 1 processor.
> > I'm not sure if you've seen bug 126, but...  We've been getting corrupted
> > output files.  For example, every two million characters or so, the file
> > will be missing a couple.  Matt is setting up tests to use the different
> > mount points (file systems), and will run tomorrow.
> >
> I wouldn't be surprised if it did turn out to be a FS issue - even simple
> NFS mount points can get really wacky sometimes (same file has different
> contents/attributes on different computers etc). For now I would say avoid
> using ptraj over any network filesystem in parallel.
> -Dan
> _______________________________________________
> AMBER-Developers mailing list
AMBER-Developers mailing list
Received on Tue Mar 16 2010 - 21:00:03 PDT
Custom Search