amber-developers: Paper on improving Gromacs scaling on ethernet. from Ross Walker on 2007-04-26 (Amber Developers Archive Apr 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 26 Apr 2007 13:54:32 -0700

Hi All,

You might be interested in the following paper that discusses improving the
scaling of Gromacs on ethernet - above 2 nodes which is the limit if you use
the defaults:

http://www3.interscience.wiley.com/cgi-bin/abstract/114205207/ABSTRACT

We are seeing the same behaviour with Amber these days - i.e. as soon as you
try to go beyond 2x2cpu nodes with gigabit ethernet the performance just
dies. This paper has a number of suggestions that highlight in particular
how the default settings of modern switches are not appropriate... Upon
reading it a lot of this, with hindsight :-), is obvious... Most switches
these days come with QOS and flow control settings optimized for a bunch of
people in an office browsing the web, listening to streaming content and
windows based file sharing. This plays havoc with MPI messages where ALL to
ALL communications get blasted by pack losses. Essentially the main tips
are:

1) Turn on IEEE 802.3x flow control on the switch and network cards -
assuming you bought a decent ethernet switch that supports this.

2) Set the switch to use QOS_PASSTHROUGH_MODE - essentially turning off QOS
so you can recoup the memory used here as general buffer space.

3) On 48 port switches only use 36 ports in the form of 9 per 12 port block.
Seems that most modern switches are constructed out of blocks of 12 port sub
switches and that the links between subswitches are only 10Gbit/s - this
limits you to 9 ports per 12 port block.

4) use openMPI or MPICH-2 - or alternatively implement ordered alltoall
communication approaches - this would apply to anything involving all to all
communication like mpi_allreduce etc... Mostly I don't think we use the all
to all communicators specifically, at least not for large datasizes, but
from their conclusions it would appear that you only benefit from ordered
all to all's or the MPICH2/openMPI ordered schemes if you implement option 3
to ensure there is no packet loss within switches.

Aside from this if you start chaining switches together and get assigned
processors on different physical switches then all bets are off.

Unfortunately (or maybe fortunately depending on one's perspective) I don't
have physical access to any ethernet based clusters anymore so I can't test
these recommendations out with AMBER but if anyone has access to such a
cluster and is happy playing with their switch configuration they may want
to experiment with the above and post feedback to the developers list. If
this really helps then perhaps we should consider putting a short
tutorial/overview on the Amber website to benefit others.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
Received on Sun Apr 29 2007 - 06:07:25 PDT