RE: amber-developers: Fw: How many atoms? from Ross Walker on 2007-12-04 (Amber Developers Archive Dec 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 4 Dec 2007 21:51:16 -0800

Hi Bob,

The key thing to remember here is that Blue Gene/L is old technology and
will largely be defunct in the timeframe of the Amber10 lifespan. Of all the
large scale machines that exist Blue Gene should be the very last one that
we target. The main advantage of Blue Gene right now is it provides you easy
access to a large number of processors to allow for testing / debugging.
However, I would not envisage anyone asking for time on Blue Gene systems to
do serious MD simulations with AMBER.

Instead the two most relevant large scale machines for US academics in the
2008 to 2010 timeframe will be Ranger at TACC and the Cray machine at ORNL.
Since ORNL has not announced what their architecture is actually going to
consist of the only metric known is Ranger. This will have 62,976 cores and
you can expect a large proportion of them to be idle at any one time, at
least in the first year of operation. Hence the landscape is changing
rapidly. This machine, I believe, will provide more SUs than the sum of all
previously allocated SUs in the history of NSF supercomputing. Hence this
should be the metric by which we measure things by. This coupled with the
ORNL machine will provide so much computing time that almost every US
academic who wishes to apply for time will be able to get more SUs than they
could hope to obtain by building their own in house cluster.

This machine will have 2GB per core of memory, 16 way nodes for 32GB of
memory per node. So given that the memory limitation will be 2GB per MPI
task at a worst case and 32GB at the best case (if you run 1 MPI thread per
node) - or just do 1 asynchronous I/O operation per node instead of per MPI
task, then what are the limitations based on this? Note that this is 64
times more memory per node than Blue Gene. Without any special modifications
to code and arrays etc what is the maximum number of atoms within this
architecture, I suspect it is significantly more than the paltry 256MB
offered by Blue Gene.

Bare in mind these nodes will have swap as well so will fail significantly
more gracefully than does Blue Gene.

This is the architecture we need to be aiming at in order to have the
maximum impact on the maximum number of users at large scale.

On the longer time scale - for Amber 11 we should be aiming at the IBM Power
7 Percs system that will be built at NCSA - but this will ultimately need a
much greater effort involving overhauling the entire MD workflow - lets hope
we get the Peta Apps grant so we can make a real impact here.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_____

From: owner-amber-developers.scripps.edu
[mailto:owner-amber-developers.scripps.edu] On Behalf Of Robert Duke
Sent: Tuesday, December 04, 2007 20:07
To: amber-developers.scripps.edu
Subject: Re: amber-developers: Fw: How many atoms?

Hi Ross et al :-)
Thanks to all who made comments. Ross pretty much understands where I am
coming from here I think (Ross, thanks for the current rundown on nsf
machine futures too; I probably have more indigestion over BG/L than
multicore, but I am indeed moderately ill that all these unbalanced
architectures are being foisted on us). Anyway, my 'expectations' regarding
memory problems have been set by a couple of recent events: 1) getting
whacked by memory limitations on BG/L for cellulose out around 2048
processors (if my memory serves...), and 2) the nature of the work I have
recently been doing with i/o and really large scaling. All along, I have
been bothered by the potential of all sorts of data structures dimensioned
by natom to push us over the edge on memory, and the more sophisticated the
code gets, the more combinations of maps and lists I use to make things fast
(so that is 2 * natom) every time I do that, or 1.[0-9] * natom if I get a
bit more clever for some things. The map structures tend to not scale down
with increasing processor count, so that has been a potential issue. The
thing that really had me pulling my hair out was expanding async i/o buffer
space requirements though. The larger the count of async i/o's you "post"
for later completion (so you can go do other things), the more buffer space
you need, and in some instances the amount of buffer space per communication
event does not scale down as well as you might like as the processor count
goes up. So at 2048 procs on BG/L running cellulose, this is what actually
bites you. I think I may have gotten around the worst memory problem in the
new scaling architecture today with minimal performance hit; I'll see over
the next week or so. But running something big on BG/L would definitely
require some careful work that I may not have time to complete.

Okay, so it sounds like people would like 1M+ atoms, nuts on BG/L
implications, so we should head in that direction. The nasty downside is
that for any memory-limited architecture, we may be setting ourselves up for
some runtime failures where folks won't understand the failure (the code
actually does produce a nice error msg for any allocation failure, but that
will show up in the system stderr rather than mdout, and could get missed,
and could happen in mid-run as loadbalancing causes changes in memory
allocation). So we should discuss how we want to specify the new format
inpcrd. Does leap already handle Darden's amoeba inpcrd format? Do folks
want something simpler? The advantage to the amoeba format is that both
pmemd and sander can already read it; they both just need to know to try for
both amoeba and non-amoeba runs. Then they also need to be able to
recognize that they are running >999,999 atoms and write the restrt in the
new format. What is the status of xleap/gleap in terms of Darden's inpcrd
format? Would it be easy to add the capability to output the new format
inpcrd for all systems generated by xleap/gleap? I don't want to divert to
work on this stuff in pmemd immediately, but if folks want to reach a
consensus on sander and xleap/gleap, then I can wedge the capability into
pmemd in a little while. Realistically speaking, I think if we expand to
100M -1 capability, we should be covered for the forseeable future, and that
is what we have with the current 'new' prmtop; of course the new prmtop and
new inpcrd actually allow going even higher by specifying a different format
than i8. The current hard architectural limit is around 134M, caused by the
size of the image identifier (27 bits; the high bits are reserved for other
info in the pairlist - also fixable). Of course you better really have a 64
bit machine and a bit more than 4 GB/core to handle this sort of stuff...

Regards - Bob

----- Original Message -----
From: Ross <mailto:ross.rosswalker.co.uk> Walker
To: amber-developers.scripps.edu
Sent: Tuesday, December 04, 2007 10:14 PM
Subject: RE: amber-developers: Fw: How many atoms?

My understanding from Bob's email, and Bob can correct me if I am wrong
here, is that it is a memory consideration. I.e. large systems could use
significant amounts of memory and it is the work in keeping the memory
footprint small that is complicated and time consuming.

However, from what I can glean Bob may have expectations for memory that are
somewhat lower than what will actually be deployed, based on experience with
Blue Gene. My assertion would be that we try to support > 999,999 atoms but
in the short term not worry about the memory requirements of such
calculations. In this way the limiting factor becomes the available memory
per node and not the underlying file formats. Since Blue Gene is the
exception rather than the rule in HPC systems I think the problem will be
much less than Bob is anticipating. It seems crazy to focus effort on
optimizing for the lowest common denominator especially when 99% of
available SUs on NSF allocated resources will shortly be non-blue gene type
architectures.

I am of course neglecting the myriad of complexities involved in terms of
performance as a function of memory usage etc but at least for Amber 10 it
would seem to make sense to aim at the types of machines that will be
generally available to NSF researchers over the next two years and all of
these will have between 1 to 2GB per core (4GB+ per core if you leave cores
idle on various nodes) and enough processors to make even Bob run away
screaming that the apocalypse is coming.

Just my 2c.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_____

From: owner-amber-developers.scripps.edu
[mailto:owner-amber-developers.scripps.edu] On Behalf Of Carlos Simmerling
Sent: Tuesday, December 04, 2007 18:10
To: amber-developers.scripps.edu
Subject: Re: amber-developers: Fw: How many atoms?

it sounded like Bob thinks there there IS a cost to doing this.
My feeling is that if there was no cost, go for it, but if it takes
away Bob's precious time that he could be using to get this
stuff up and working for smaller systems, then we should let
him focus on the sizes that people actually run rather than having
delays or overall slower code just to support things that none of us
actually simulate. Sure, it could be great PR, and yes, maybe
focusing on smaller systems isn't visionary enough, but I think
there is a lot to be gained by getting better code for more modest
systems that still have biological relevance, rather that us wasting
Bob's time on code that none of us need (yet).
carlos

On Dec 4, 2007 8:46 PM, Ken Merz <merz.qtp.ufl.edu> wrote:

Hi,
If it costs us nothing then why not scale PMEMD beyond 999,999 atoms.
Someone out there might want to do 1MM+ atom simulation with the AMBER
program suite! Kennie

On 4 Dec 2007, at 2:14 PM, Robert Duke wrote:

Hello folks!
I am working hard on high-scaling pmemd code, and in the course of the work
it became clear to me, due to large async i/o buffer and other issues, that
going to very high atom counts may require a bunch of extra work, especially
on certain platforms (BG/L in particular...). I posed the question below
to Dave Case; he suggested I bounce it off the list, so here it is. The
crux of the matter is how people feel about having an MD capability in pmemd
for systems bigger than 999,999 atoms in the next release. Please respond
to the dev list if you have strong feelings in either direction.
Thanks much! - Bob

----- Original Message ----- From: "Robert Duke" <rduke.email.unc.edu>
To: "David A. Case" < <mailto:case.scripps.edu> case.scripps.edu>
Sent: Tuesday, December 04, 2007 8:45 AM
Subject: How many atoms?

Hi Dave,
Just thought I would pulse you about how strong the desire is to go above
1,000,000 atom systems in the next release. I personally see this as more
an advertising issue than real science; it's hard to get good
statistics/good science on 100,000 atoms let alone 10,000,000 atoms.
However, we do have competition. So the prmtop is not an issue, but the
inpcrd format is, and one thing that could be done is to move to supporting
the same type of flexible format in the inpcrd as we do in the new-style
prmtop. Tom D. has an inpcrd format in amoeba that would probably do the
trick; I can easily read this in pmemd but not yet write it (I actually have
pulled the code out - left it in the amoeba version of course, but can put
it back in as needed). I ask the question now because I am hitting size
issues already on BG/L on something like cellulose. Some of this I can fix;
some of it really is more appropriately fixed by running on 64 bit memory
systems where there actually is a multi-GB physical memory. The problem is
particularly bad with some new code I am developing, due to extensive async
i/o and requirements for buffers that at least theoretically could be pretty
big (up to natom possible; by spending a couple of days writing really
complicated code I can actually handle this in small amounts of space with
effectively no performance impact - but it is the sort of thing that will be
touchy and require additional testing). Anyway, I do want to gauge the
desire to move up past 999,999 atoms, and make the point that on something
like BG/L, it would actually require a lot more work to be able to run
multi-million atom problems (basically got to go back and look at all the
allocations, make them dense rather than sparse by doing all indexing
through lists, allow for adaptive minimal i/o buffers, etc. etc. - messy
stuff, some of it sourcing from having to allocate lots of arrays
dimensioned by natom).
Best Regards - Bob

Professor Kenneth M. Merz, Jr.
Department of Chemistry
Quantum Theory Project
2328 New Physics Building
PO Box 118435
University of Florida
Gainesville, Florida 32611-8435

e-mail: merz.qtp.ufl.edu
http://www.qtp.ufl.edu/~merz <http://www.qtp.ufl.edu/%7Emerz>

Phone: 352-392-6973
FAX: 352-392-8722
Cell: 814-360-0376

-- 
===================================================================
Carlos L. Simmerling, Ph.D.
Associate Professor                 Phone: (631) 632-1336 
Center for Structural Biology       Fax:   (631) 632-1555
CMM Bldg, Room G80
Stony Brook University              E-mail: carlos.simmerling.gmail.com
Stony Brook, NY 11794-5115          Web: http://comp.chem.sunysb.edu
===================================================================

Received on Wed Dec 05 2007 - 06:07:39 PST