Re: amber-developers: Fw: How many atoms? from Robert Duke on 2007-12-05 (Amber Developers Archive Dec 2007)

From: Robert Duke <rduke.email.unc.edu>
Date: Wed, 5 Dec 2007 08:39:56 -0500

Hi Ross,
Well, that is a lot of processors; all the eggs in two baskets, eh? Okay, we'll plan on the minimum to be able to run 1-10M (maybe more), may the user beware. Would whoever is responsible for xleap/gleap (sorry, but I have been bad keeping track of that end of the amber wilderness) please let me know what is currently supported or what you will support. Ross, do you have bandwidth to hack capability into sander? While on the one hand I think the amoeba inpcrd format is overkill, it does have the virtue of solving all future potential problems, and as I said before, we can currently read this stuff in an amoeba context. Thanks to all for input again; thanks to Ross for actually having a clue about what the funding agencies and supercomputer centers are doing (on the one hand I like tracking the technology, but on the other hand I am not fond of the politics). Carlos, Adrian, all you guys with that really big BG/L out in NY state somewhere, just bear in mind that multi-million atom simulations on that machine may not be real smooth ;-) (but hopefully the work I am doing now will have you in a good position to utilize the beast for reasonable-sized systems).
Best Regards - Bob
  ----- Original Message -----
  From: Ross Walker
  To: amber-developers.scripps.edu
  Sent: Wednesday, December 05, 2007 12:51 AM
  Subject: RE: amber-developers: Fw: How many atoms?

  Hi Bob,

  The key thing to remember here is that Blue Gene/L is old technology and will largely be defunct in the timeframe of the Amber10 lifespan. Of all the large scale machines that exist Blue Gene should be the very last one that we target. The main advantage of Blue Gene right now is it provides you easy access to a large number of processors to allow for testing / debugging. However, I would not envisage anyone asking for time on Blue Gene systems to do serious MD simulations with AMBER.

  Instead the two most relevant large scale machines for US academics in the 2008 to 2010 timeframe will be Ranger at TACC and the Cray machine at ORNL. Since ORNL has not announced what their architecture is actually going to consist of the only metric known is Ranger. This will have 62,976 cores and you can expect a large proportion of them to be idle at any one time, at least in the first year of operation. Hence the landscape is changing rapidly. This machine, I believe, will provide more SUs than the sum of all previously allocated SUs in the history of NSF supercomputing. Hence this should be the metric by which we measure things by. This coupled with the ORNL machine will provide so much computing time that almost every US academic who wishes to apply for time will be able to get more SUs than they could hope to obtain by building their own in house cluster.

  This machine will have 2GB per core of memory, 16 way nodes for 32GB of memory per node. So given that the memory limitation will be 2GB per MPI task at a worst case and 32GB at the best case (if you run 1 MPI thread per node) - or just do 1 asynchronous I/O operation per node instead of per MPI task, then what are the limitations based on this? Note that this is 64 times more memory per node than Blue Gene. Without any special modifications to code and arrays etc what is the maximum number of atoms within this architecture, I suspect it is significantly more than the paltry 256MB offered by Blue Gene.

  Bare in mind these nodes will have swap as well so will fail significantly more gracefully than does Blue Gene.

  This is the architecture we need to be aiming at in order to have the maximum impact on the maximum number of users at large scale.

  On the longer time scale - for Amber 11 we should be aiming at the IBM Power 7 Percs system that will be built at NCSA - but this will ultimately need a much greater effort involving overhauling the entire MD workflow - lets hope we get the Peta Apps grant so we can make a real impact here.

  All the best
  Ross
  /\
  \/
  |\oss Walker

  | HPC Consultant and Staff Scientist |
  | San Diego Supercomputer Center |
  | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
  | http://www.rosswalker.co.uk | PGP Key available on request |

  Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

----------------------------------------------------------------------------
    From: owner-amber-developers.scripps.edu [mailto:owner-amber-developers.scripps.edu] On Behalf Of Robert Duke
    Sent: Tuesday, December 04, 2007 20:07
    To: amber-developers.scripps.edu
    Subject: Re: amber-developers: Fw: How many atoms?

    Hi Ross et al :-)
    Thanks to all who made comments. Ross pretty much understands where I am coming from here I think (Ross, thanks for the current rundown on nsf machine futures too; I probably have more indigestion over BG/L than multicore, but I am indeed moderately ill that all these unbalanced architectures are being foisted on us). Anyway, my 'expectations' regarding memory problems have been set by a couple of recent events: 1) getting whacked by memory limitations on BG/L for cellulose out around 2048 processors (if my memory serves...), and 2) the nature of the work I have recently been doing with i/o and really large scaling. All along, I have been bothered by the potential of all sorts of data structures dimensioned by natom to push us over the edge on memory, and the more sophisticated the code gets, the more combinations of maps and lists I use to make things fast (so that is 2 * natom) every time I do that, or 1.[0-9] * natom if I get a bit more clever for some things. The map structures tend to not scale down with increasing processor count, so that has been a potential issue. The thing that really had me pulling my hair out was expanding async i/o buffer space requirements though. The larger the count of async i/o's you "post" for later completion (so you can go do other things), the more buffer space you need, and in some instances the amount of buffer space per communication event does not scale down as well as you might like as the processor count goes up. So at 2048 procs on BG/L running cellulose, this is what actually bites you. I think I may have gotten around the worst memory problem in the new scaling architecture today with minimal performance hit; I'll see over the next week or so. But running something big on BG/L would definitely require some careful work that I may not have time to complete.

    Okay, so it sounds like people would like 1M+ atoms, nuts on BG/L implications, so we should head in that direction. The nasty downside is that for any memory-limited architecture, we may be setting ourselves up for some runtime failures where folks won't understand the failure (the code actually does produce a nice error msg for any allocation failure, but that will show up in the system stderr rather than mdout, and could get missed, and could happen in mid-run as loadbalancing causes changes in memory allocation). So we should discuss how we want to specify the new format inpcrd. Does leap already handle Darden's amoeba inpcrd format? Do folks want something simpler? The advantage to the amoeba format is that both pmemd and sander can already read it; they both just need to know to try for both amoeba and non-amoeba runs. Then they also need to be able to recognize that they are running >999,999 atoms and write the restrt in the new format. What is the status of xleap/gleap in terms of Darden's inpcrd format? Would it be easy to add the capability to output the new format inpcrd for all systems generated by xleap/gleap? I don't want to divert to work on this stuff in pmemd immediately, but if folks want to reach a consensus on sander and xleap/gleap, then I can wedge the capability into pmemd in a little while. Realistically speaking, I think if we expand to 100M -1 capability, we should be covered for the forseeable future, and that is what we have with the current 'new' prmtop; of course the new prmtop and new inpcrd actually allow going even higher by specifying a different format than i8. The current hard architectural limit is around 134M, caused by the size of the image identifier (27 bits; the high bits are reserved for other info in the pairlist - also fixable). Of course you better really have a 64 bit machine and a bit more than 4 GB/core to handle this sort of stuff...

    Regards - Bob
      ----- Original Message -----
      From: Ross Walker
      To: amber-developers.scripps.edu
      Sent: Tuesday, December 04, 2007 10:14 PM
      Subject: RE: amber-developers: Fw: How many atoms?

      My understanding from Bob's email, and Bob can correct me if I am wrong here, is that it is a memory consideration. I.e. large systems could use significant amounts of memory and it is the work in keeping the memory footprint small that is complicated and time consuming.

      However, from what I can glean Bob may have expectations for memory that are somewhat lower than what will actually be deployed, based on experience with Blue Gene. My assertion would be that we try to support > 999,999 atoms but in the short term not worry about the memory requirements of such calculations. In this way the limiting factor becomes the available memory per node and not the underlying file formats. Since Blue Gene is the exception rather than the rule in HPC systems I think the problem will be much less than Bob is anticipating. It seems crazy to focus effort on optimizing for the lowest common denominator especially when 99% of available SUs on NSF allocated resources will shortly be non-blue gene type architectures.

      I am of course neglecting the myriad of complexities involved in terms of performance as a function of memory usage etc but at least for Amber 10 it would seem to make sense to aim at the types of machines that will be generally available to NSF researchers over the next two years and all of these will have between 1 to 2GB per core (4GB+ per core if you leave cores idle on various nodes) and enough processors to make even Bob run away screaming that the apocalypse is coming.

      Just my 2c.

      All the best
      Ross
      /\
      \/
      |\oss Walker

      | HPC Consultant and Staff Scientist |
      | San Diego Supercomputer Center |
      | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
      | http://www.rosswalker.co.uk | PGP Key available on request |

      Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

------------------------------------------------------------------------
        From: owner-amber-developers.scripps.edu [mailto:owner-amber-developers.scripps.edu] On Behalf Of Carlos Simmerling
        Sent: Tuesday, December 04, 2007 18:10
        To: amber-developers.scripps.edu
        Subject: Re: amber-developers: Fw: How many atoms?

        it sounded like Bob thinks there there IS a cost to doing this.
        My feeling is that if there was no cost, go for it, but if it takes
        away Bob's precious time that he could be using to get this
        stuff up and working for smaller systems, then we should let
        him focus on the sizes that people actually run rather than having
        delays or overall slower code just to support things that none of us
        actually simulate. Sure, it could be great PR, and yes, maybe
        focusing on smaller systems isn't visionary enough, but I think
        there is a lot to be gained by getting better code for more modest
        systems that still have biological relevance, rather that us wasting
        Bob's time on code that none of us need (yet).
        carlos

        On Dec 4, 2007 8:46 PM, Ken Merz <merz.qtp.ufl.edu> wrote:

          Hi,
           If it costs us nothing then why not scale PMEMD beyond 999,999 atoms. Someone out there might want to do 1MM+ atom simulation with the AMBER program suite! Kennie

          On 4 Dec 2007, at 2:14 PM, Robert Duke wrote:

            Hello folks!
            I am working hard on high-scaling pmemd code, and in the course of the work it became clear to me, due to large async i/o buffer and other issues, that going to very high atom counts may require a bunch of extra work, especially on certain platforms (BG/L in particular...). I posed the question below to Dave Case; he suggested I bounce it off the list, so here it is. The crux of the matter is how people feel about having an MD capability in pmemd for systems bigger than 999,999 atoms in the next release. Please respond to the dev list if you have strong feelings in either direction.
            Thanks much! - Bob

            ----- Original Message ----- From: "Robert Duke" <rduke.email.unc.edu>
            To: "David A. Case" < case.scripps.edu>
            Sent: Tuesday, December 04, 2007 8:45 AM
            Subject: How many atoms?

              Hi Dave,
              Just thought I would pulse you about how strong the desire is to go above 1,000,000 atom systems in the next release. I personally see this as more an advertising issue than real science; it's hard to get good statistics/good science on 100,000 atoms let alone 10,000,000 atoms. However, we do have competition. So the prmtop is not an issue, but the inpcrd format is, and one thing that could be done is to move to supporting the same type of flexible format in the inpcrd as we do in the new-style prmtop. Tom D. has an inpcrd format in amoeba that would probably do the trick; I can easily read this in pmemd but not yet write it (I actually have pulled the code out - left it in the amoeba version of course, but can put it back in as needed). I ask the question now because I am hitting size issues already on BG/L on something like cellulose. Some of this I can fix; some of it really is more appropriately fixed by running on 64 bit memory systems where there actually is a multi-GB physical memory. The problem is particularly bad with some new code I am developing, due to extensive async i/o and requirements for buffers that at least theoretically could be pretty big (up to natom possible; by spending a couple of days writing really complicated code I can actually handle this in small amounts of space with effectively no performance impact - but it is the sort of thing that will be touchy and require additional testing). Anyway, I do want to gauge the desire to move up past 999,999 atoms, and make the point that on something like BG/L, it would actually require a lot more work to be able to run multi-million atom problems (basically got to go back and look at all the allocations, make them dense rather than sparse by doing all indexing through lists, allow for adaptive minimal i/o buffers, etc. etc. - messy stuff, some of it sourcing from having to allocate lots of arrays dimensioned by natom).
              Best Regards - Bob

          Professor Kenneth M. Merz, Jr.
          Department of Chemistry
          Quantum Theory Project
          2328 New Physics Building
          PO Box 118435
          University of Florida
          Gainesville, Florida 32611-8435

          e-mail: merz.qtp.ufl.edu
          http://www.qtp.ufl.edu/~merz

          Phone: 352-392-6973
          FAX: 352-392-8722
          Cell: 814-360-0376

        --
        ===================================================================
        Carlos L. Simmerling, Ph.D.
        Associate Professor Phone: (631) 632-1336
        Center for Structural Biology Fax: (631) 632-1555
        CMM Bldg, Room G80
        Stony Brook University E-mail: carlos.simmerling.gmail.com
        Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
        ===================================================================
Received on Sun Dec 09 2007 - 06:07:07 PST