amber-developers: Some testing result on pmemd scaling and parallel computation

From: Mengjuei Hsieh <mengjueh.uci.edu>
Date: Mon, 29 Sep 2008 18:27:37 -0700

Here is some recap on the JAC benchmark performance on different
parallel options I did this weekend.

We were trying to explore the options for network connection with
jumbo frame (also known as large MTU, mtu=9000 in linux) gigabit
ethernet local network to see if we can replace the previous parallel
computing solution of connecting two machines directly with an
ethernet cable (we called it sub-pairs to reflect the fact that by
doing so, the machines will be grouped in pairs). The reason is
obvious, grouping computing nodes in pairs is not an efficient way to
work with nor to manage the nodes.

We tested with the NetPipe benchmark to measure the performance of a
gigabit ethernet with or without jumbo frame, the benchmark is
consistent with general wisdom and references on the internet or on
the literature. I thought we could utilize more bandwidth with jumbo
frame ethernet.

First, I tested the scaling of amber 9 pmemd with lam/mpi or mpich on
jumbo frame ethernet. The configurations of the testing environment
look like this:

Two identical dell poweredge 1950, each comes with 2 intel xeon 5140
woodcrest duo-core processors, 4MB cache, 2GB RAM. Shared memory
interconnect / MPICH-1.2.6 / LAM-MPI 7.1.4 Intel Fortran 90 compiler,
Intel MKL

The results of the parallel performance are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs nsec/day scaling, %

  1 0.329 --
  2 0.628 95 (SMP)
  4 1.094 83 (SMP)
  4 0.965 73 (TCP, 1+1+1+1)
  4 0.819 62 (SMP/TCP, 2+2)
  8 0.987 37 (SMP/TCP, 4+4)

This does not meet the definition of "scaling" therefore the network
traffic was also measured and I found that in the case of the network
communication, only 30% of the bandwidth is recorded. For some
sidenotes, these are under the parameter of at least
P4_SOCKBUFSIZE=131072 (mpich) and net.core.rmem_max=131072
net.core.wmem_max=131072, similar results have been observed under
lam-mpi rpi_tcp_short=131072.

Further test on direct connection pairs shows that the measurement is similar.

Therefore the benchmark fell back to amber 8 pmemd, which is the
original program we had in the sub-pair configuration.

the results of the parallel performance with amber 8 pmemd are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs nsec/day scaling, %

  1 0.203 --
  2 0.391 96 (SMP)
  4 0.465 57 (SMP)
  4 0.457 56 (SMP/TCP, 2+2)
  8 0.680 42 (SMP/TCP, 4+4)

Less efficient amber 8 pmemd makes the scaling factor of 4+4cpus
parallel computation look better, but the performance is definitely
not better. Similar results were observed on directly connected pairs.

The interest of this exploration then turns to the scaling of the
AMBER 10 pmemd, and the results are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs nsec/day scaling, %

  1 0.411 --
  4 1.329 80 (SMP)
  8 1.137 35 (SMP/TCP, 4+4)

At this point, I can say is don't expect anything too interesting from
gigabit ethernet performance. This conclusion is consistent with
observation from Dr. Duke and Dr. Walker.

A further benchmark has been done for Amber10 pmemd on a dual
quad-cores intel xeon E5410 machine (dell PE1950, 2.3GMhz, 6MB cache,
2G RAM):
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms (on the same machine, SMP mode)

#procs nsec/day scaling, %

  1 0.434 --
  2 0.815 94
  4 1.464 84
  6 1.964 75
  8 2.274 65

That's all. AMBER 10 pmemd rocks.

Bests,
--
Mengjuei
Received on Wed Oct 01 2008 - 05:09:16 PDT
Custom Search