Re: amber-developers: FW: benchmark of Amber 9 on shiraz from Robert Duke on 2006-05-03 (Amber Developers Archive May 2006)

From: Robert Duke <rduke.email.unc.edu>
Date: Wed, 3 May 2006 14:15:31 -0400

Yong -
GigE == Gigabit ethernet, right? If this scales well past 8 nodes, you are
doing well; the jump from 32 to 64 should be disastrous, and frankly I am
surprised you say it is "scaling well" at 32 cpu's. Okay, one caveat, which
is you have dual core dual cpu opterons, so you have 4 processors per node
that don't have to go out on the gigabit stuff to intercommunicate (so maybe
GigE would work acceptably, by my standards, out to 16 cpu under this
scenario). So here's the deal. As the system size increases, you can
better decompose the system, but you also have lots more data to fling
around on a slow interconnect. If you have a small system, it does not
decompose into independent chunks as well, but it is also significantly less
data to push over the GigE net. Personally, I am trying to get people to
move to infiniband where they can. I don't have nice neat numbers I can
quote, but I have done bunches of back-of-the-envelope calcs on
communications requirement vs. bandwidth for gigabit ethernet, to convince
myself that the scaling I see (or don't see) is reasonable. PME has a big
interconnect i/o requirement.

Okay, I scrolled down and looked at the actual numbers. The small system
falls apart as you go past 4 cpu's. It is GB, not PME, and completely
uncomparable to PME. For the 23K vs ~235K systems (both pme), you start
your comparisons at 4 cpu, so you really don't know what baseline you have
for scaling (I typically baseline against 2 cpu, not 1, because the mpi and
uniprocessor code is significantly different, and I have not designed mpi
pmemd to run on 1 processor (I did not think it worthwhile). SO to me the
most surprising thing is that these two systems look sort of okay out to 32
proc, but once again, with gb ethernet, baselining against 4 cpu's is
misleading. The other surprising thing is the jump in performance between
16 and 32 proc for the largest system, 239K atoms. Given the size of the
problem and the fact you are using dual core dual cpu, probably with shared
memory cache at some level, this may well be a caching effect where the
amount of memory needed by a 4 cpu node finally drops below a level where it
is possible to get a pretty good cache hit ratio. This memory stuff does
matter, and you see this sort of effect on large systems when you have
shared component systems. Follow the bouncing bottleneck. And as far as
"hitting a wall" between 32 and 64 procs, well as I say, I am overjoyed it
is not between 16 and 32, and ALL these systems hit a wall and degrade
dramatically at some point when the bandwidth of some component is gone, and
the delays in one location cascade everywhere else (a primary tenet of
system design is to try to get graceful degradation, but in general any
system will hit a point under extreme load, or when asked to do the
impossible, where things go completely to pieces; that is one reason we
benchmark - to show users what is practical rather than having them
routinely try running GB ethernet with 128 procs and then carping because
the performance stinks).
Regards - Bob

----- Original Message -----
From: "Yong Duan" <duan.ucdavis.edu>
To: <amber-developers.scripps.edu>
Sent: Wednesday, May 03, 2006 1:43 PM
Subject: amber-developers: FW: benchmark of Amber 9 on shiraz

>
> Hi Guys,
>
> We are benchmarking a cluster of dual-core dual-cpu opteron (1.8 GHz) with
> GigE and noticed a funny behavior on scaling. The PMEMD scales very well
> below 32-cpu level which is great. But as soon as we tried 64-cpu level,
> the
> scaling became notably poor, regardless of the system size. We initially
> thought this must be related to system size. We then tried 23,000-atom and
> 230,000-atom systems and noticed they behaved the same way. Any hint?
>
> yong
>
> -----Original Message-----
> From: choo woo [mailto:koolben3.yahoo.com]
> Sent: Wednesday, May 03, 2006 10:37 AM
> To: Yong Duan
> Subject: RE: benchmark of Amber 9 on shiraz
>
>
> I have no idea. When I get some time later, I may look
> into the detail.
> Chun
>
> --- Yong Duan <duan.ucdavis.edu> wrote:
>
>>
>> Chun,
>>
>> Why there is a "barrier" at 32/64CPU level,
>> regardless of system size? The
>> scaling looks pretty good at the 32-cpu level but
>> drops significantly at the
>> 64-cpu level, regardless of the system size. In
>> other words, why 8 nodes
>> work better than 16 nodes?
>>
>> yong
>>
>> > -----Original Message-----
>> > From: choo woo [mailto:koolben3.yahoo.com]
>> > Sent: Wednesday, May 03, 2006 10:25 AM
>> > To: Lin, Dawei; Yong Duan; Lewis, Mike;
>> benwu.ucdavis.edu
>> > Cc: duan_group.albert.genomecenter.ucdavis.edu
>> > Subject: benchmark of Amber 9 on shiraz
>> >
>> >
>> > Shiraz performs well!
>> >
>> > As for small system , the simulation can be scaled
>> up
>> > to only 4 CPUs (16.8 ns per day for ~800 atom
>> system).
>> > As to large/very large system, it can be scaled up
>> 32
>> > CPUs ( 8 ns per day for ~30000 atom system; 1.6 ns
>> per
>> > day for ~240000 atom system).
>> >
>> > Chun
>> >
>> >
>> > Amber 9
>> >
>> > 1.) small systm:
>> > protein G
>> > 855 atoms 56 residues 10ps
>> >
>> > GBSA simulation
>> > ifort+MKL
>> >
>> > ./2GB1.00/2GB1.00_0001.out 1CPU
>> > | Runmd Time 175.90 (100.0% of
>> Total)
>> >
>> > /2GB1.01/2GB1.01_0001.out 4CPUs
>> > | Runmd Time 51.41 (99.75% of
>> Total)
>> >
>> > ./2GB1.02/2GB1.02_0001.out 8CPUs
>> > | Runmd Time 46.22 (99.38% of
>> Total)
>> >
>> > ./2GB1.03/2GB1.03_0001.out 16CPUs
>> > | Runmd Time 58.57 (98.32% of
>> Total)
>> >
>> >
>> > 2.) Large system
>> >
>> > 27404 atoms 10ps ca 120 residues + waters
>> >
>> > PMEMD, pathscale
>> >
>> > ./sh2c.01/sh2c.01_0001.out 4 CPUs
>> > | Master Total CPU time: 740.43 seconds
>>
>> > 0.21 hours
>> >
>> > ./sh2c.02/sh2c.02_0001.out 8 CPUs
>> > | Master Total CPU time: 373.62 seconds
>>
>> > 0.10 hours
>> >
>> > ./sh2c.03/sh2c.03_0001.out 16 CPUs
>> > | Master Total CPU time: 207.80 seconds
>>
>> > 0.06 hours
>> >
>> > ./sh2c.04/sh2c.04_0001.out 32 CPUs
>> > | Master Total CPU time: 109.58 seconds
>>
>> > 0.03 hours
>> >
>> > ./sh2c.05/sh2c.05_0001.out 64 CPUs
>> > | Master Total CPU time: 127.88 seconds
>>
>> > 0.04 hours
>> >
>> > 3. very large system
>> > 238985 atoms 10ps
>> > PMEMD pathf90
>> >
>> > ./hist1.01/hist1.01_0001.out 4CPUs
>> > | Master Total CPU time: 5966.00 seconds
>>
>> > 1.66 hours
>> >
>> > ./hist1.02/hist1.02_0001.out 8CPUs
>> > | Master Total CPU time: 3029.44 seconds
>>
>> > 0.84 hours
>> >
>> > ./hist1.03/hist1.03_0001.out 16CPUs
>> > | Master Total CPU time: 1569.34 seconds
>>
>> > 0.44 hours
>> >
>> > ./hist1.04/hist1.04_0001.out 32CPUs
>> > | Master Total CPU time: 546.66 seconds
>>
>> > 0.15 hours
>> >
>> > ./hist1.05/hist1.05_0001.out 64CPUs
>> > | Master Total CPU time: 728.11 seconds
>>
>> > 0.20 hours
>> >
>> >
>
>
Received on Thu May 04 2006 - 17:10:38 PDT