On Tue, Mar 16, 2010 at 9:54 PM, Lachele Foley <lfoley.ccrc.uga.edu> wrote:
> Are you getting slowdown or hangs forever? For me, it never completes --
> or, at least, doesn't complete after 45 minutes on four processors.
> Compared to two seconds, that's close enough to forever for me.
>
Slowdowns only - my tests compelte. Here are some timing results over an NFS
filesystem for the first part of the ptraj_comprehensive test case (the part
that uses the ptraj.in input file):
Single processor:
-rwxr-xr-x 1 droe case 2001557 Mar 16 16:12
/home/droe/Amber/CVS/amber11/bin/ptraj
4 seconds.
Timings...
-------------------------------
| Check Input Time | 0.000 |
| Input Time | 0.000 |
| Output Time | 0.010 |
| Action Time | 4.040 |
|------------------|----------|
| Total Time | 4.050 |
-------------------------------
Pretty consistent - run takes 4 seconds, which agrees with the internal
timings.
2 processor:
-rwxr-xr-x 1 droe case 3098145 Mar 16 15:21
/home/droe/Amber/CVS/amber11/bin/ptraj.MPI
case1
time for 1 loops = 0.000419139862061 seconds
13 seconds.
Timings...
------------------------------------------
| Rank | 0 | 1 |
|------------------|----------|----------|
| Check Input Time | 0.010 | 0.009 |
| Input Time | 1.346 | 1.131 |
| Output Time | 1.013 | 1.176 |
| Action Time | 2.801 | 2.853 |
|------------------|----------|----------|
| Total Time | 5.170 | 5.169 |
------------------------------------------
-----------------------------------------------------
| | Average | Longest | Total |
|------------------|----------|----------|----------|
| Check Input Time | 0.009 | 0.010 | 0.019 |
| Input Time | 1.239 | 1.346 | 2.477 |
| Output Time | 1.094 | 1.176 | 2.188 |
| Action Time | 2.827 | 2.853 | 5.655 |
|------------------|----------|----------|----------|
| Total Time | 5.169 | 5.385 | 10.339 |
-----------------------------------------------------
Note how even though the internal timings for the multiprocessor run are
only a little slower, the actual runtime (13 s, first line) is over twice
that, which implies communication issues. Now take a look at a run on a
local disk (I'm only showing 2 processors - the timings for 1 processor are
essentially the same):
2 processors:
-rwxr-xr-x 1 droe case 3098145 Mar 16 15:21
/home/droe/Amber/CVS/amber11/bin/ptraj.MPI
case1
time for 1 loops = 0.00016713142395 seconds
2 seconds.
Timings...
------------------------------------------
| Rank | 0 | 1 |
|------------------|----------|----------|
| Check Input Time | 0.000 | 0.001 |
| Input Time | 0.158 | 0.152 |
| Output Time | 0.007 | 0.128 |
| Action Time | 2.183 | 2.069 |
|------------------|----------|----------|
| Total Time | 2.349 | 2.350 |
------------------------------------------
-----------------------------------------------------
| | Average | Longest | Total |
|------------------|----------|----------|----------|
| Check Input Time | 0.001 | 0.001 | 0.001 |
| Input Time | 0.155 | 0.158 | 0.310 |
| Output Time | 0.068 | 0.128 | 0.136 |
| Action Time | 2.126 | 2.183 | 4.252 |
|------------------|----------|----------|----------|
| Total Time | 2.350 | 2.470 | 4.699 |
-----------------------------------------------------
Now there is a speedup compared to 1 processor.
> I'm not sure if you've seen bug 126, but... We've been getting corrupted
> output files. For example, every two million characters or so, the file
> will be missing a couple. Matt is setting up tests to use the different
> mount points (file systems), and will run tomorrow.
>
I wouldn't be surprised if it did turn out to be a FS issue - even simple
NFS mount points can get really wacky sometimes (same file has different
contents/attributes on different computers etc). For now I would say avoid
using ptraj over any network filesystem in parallel.
-Dan
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Tue Mar 16 2010 - 20:00:04 PDT