amber-developers: Troubles at PSC

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 4 May 2006 14:31:52 -0400

Guys -
This is a "heads-up" if you are attempting to do high scaling jobs at PSC on
bigben these days. I recently went out there to confirm some benchmarking
values/help their applications support people (pmemd of course). Now the
standard factor ix benchmark (as well as jac) in the amber 9 benchmark suite
are not "real" benchmarks in my view in that they do not write trajectories,
as would be done for most md runs. SO I have always benchmarked with a more
realistic setup, whereby I dump trajectory every 250 steps (probably should
be even more frequent for factor ix since it uses a 1.5 fsec timestep).
Anyway, typically the difference in performance between the amber factor ix
benchmark and my nvt factor ix benchmark + trajectory is typically
negligible. Well, all the disk systems at psc on bigben are currently hosed
and if you try to write anything to scratch volumes, depending on the phase
of the moon you may stall very badly in the master. This is not a pmemd
problem; it is a poor system management problem. You also can't retreat to
your home volumes; they are stalling also. If you run the standard "what
trajectory? who uses a trajectory?" benchmarks in our tree, everything will
look fine except for perhaps a slow startup time (2 minutes instead of 2
seconds) associated with the master reading the prmtop/inpcrd. This is
really really bad for real runs. I had a 64 processor benchmark yesterday
that took 3 times longer to run than usual, after spending something like 3
minutes in setup. I have complained loudly. My friend John Urbanic at psc
(associated with the xt3 from the start - I was in friendly-user mode with
John as support shortly after they got the xt3 out of the crate at psc) is
trying to help, but in my experience psc has a really bad history in regard
to user services once a machine goes public. In the pmemd 8 timeframe they
took over a year to get system software fixed on lemieux that would allow
the use of two rails (communications paths for mpi - significantly better
performance). Also, the whole time that I have been testing on lemieux, I
have used the home volumes because the scratch volumes have demonstrated the
same sort of sporadic stalling I am now seeing on bigben. Very frustrating.
I think I will be moving the majority of my efforts to other computer
centers unless I see some action at psc fast, for a change. Oh, and another
aside, I also had them wipe out a bunch of prmtop/inpcrds that were copied
out onto scratch in disk scavenging after they had been there < 12 hrs.
Apparently they look at "last modified" times, not "last accessed" times in
scavenging on scratch, so you can't be sure your files will survive through
a run unless you do an nonarchival copy or touch them after moving them. A
real feature (saved me a few hours of cpu time last night because they blew
away 3/4 of my runs).
Best Regards - Bob
Received on Fri May 05 2006 - 08:13:56 PDT
Custom Search