Installation Notes for PMEMD 9

I.  Installation of PMEMD on Supported Platforms

Installation of PMEMD is fairly simple for a number of common platforms:

1) Make amber9/src/pmemd your current working directory.

2) Enter the command 'configure -help' to get a list of supported platforms,
   fortran 90 compilers, parallel implementations (or nopar to build
   uniprocessor code), fft implementations, and trajectory file format
   implementations.

3) Presuming a supported combination describes your system, you can now make
   the config.h header for the PMEMD build by entering:
   './configure <platform> <compiler> <parallel option> [fft option]
   [traj option]' (eg., './configure linux_p4 ifort mpich fftw bintraj').  
   For generic parallel implementations (everything but mpi or nopar), you will
   be queried for the location where the parallel implementation is installed.
   The fft and traj options have defaults; all other arguments must be
   provided, and must be in the order listed.

4) Now simply enter the command 'make install'.  If all goes well, a pmemd
   executable will be deposited under amber9/exe.

Additional Details on Some Platforms and Options

Intel Math Kernel Library Support

If you are using the Intel ifort compiler, you will be asked if you want to
use the Intel Math Kernel Library (MKL).  If you answer "yes", you will be
prompted for the location unless the MKL_HOME environment variable has been
set.  If MKL_HOME is set, it is assumed you want to use MKL.  The MKL is
primarily helpful for Generalized Born simulations on Intel architecture 
platforms (*athlon, *p4, *opteron, *em64t, and sgi_altix (our only supplied
ia64 platform at present)).

FFT Options

The default FFT implementation, PUBFFT, is based on fftpack code from netlib,
originally written by Swartztrauber, first ported to Amber by Tom Darden, and
since converted to F90 by Bob Duke.  This version has fairly reasonable
performance, and supports fft's with prime factors of 2, 3, and 5.  In many
cases there is little advantage to selecting anything else.  However, if you
are running a machine that supports Intel SSE2 instructions, you will get
slightly better performance by using FFTW 3.0 or FFTW 3.1, and you will also
likely get a more optimal grid size because FFTW also supports a prime factor
of 7.  The only vendor FFT currently supported is for SGI, and we support both
the newest interface as well as the interface that was used in earlier versions
of Amber.  A relatively clean interface for adding new FFT implementations has
been created in fft1d.fpp.  You would of course have to create a new defined
constant.

NetCDF Support

For both pmemd and sander, trajectory and velocity files may now be written
using the portable binary NetCDF format.  The files created are about half the
size of their formatted equivalents, and the overhead associated with writing
NetCDF files is low.  A NetCDF source tree is provided as part of Amber; you
may also just use an existing NetCDF installation, which is a fairly common
thing at large computer installations.  At high processor count there are
significant performance advantages associated with using NetCDF files, on the
order of 10%, depending of course on various system and simulation problem
specifics.  In order to be able to write NetCDF files, you must compile pmemd
with the bintraj option, and then specify ioutfm=1 in the mdin &cntrl namelist.
The output trajectories may be analyzed or converted to formatted output for
use in other programs with Amber 9 ptraj.  VMD 1.8.3 also has support for the
NetCDF format.

IBM AIX SP5 Platforms

We strongly recommend building a 32-bit pmemd executable for AIX SP5 platforms,
as the performance is better.

IBM AIX SP3/SP4/SP5 Platforms

For the IBM AIX platforms, it is possible to use the MASSVP libraries if one
wishes.  In pmemd 8, this improved performance.  In pmemd 9, different
algorithms are used and it does not really matter whether MASSVP libraries are
linked to pmemd or not for PME.  For generalized Born simulations, however,
linking in the MASSVP libraries should improve performance.  For all IBM AIX
platforms we specify the USE_MPI_MODULE define to cause mpi module files to be
used.  If this is not done, some build combinations may not correctly call MPI
functions, and pmemd will crash.  This is only required for IBM AIX platforms;
all other builds work fine if MPI is treated as an F77 library.

OSF1 AlphaServer / Quadrics Interconnect Platforms

We routinely run on lemieux.psc.edu, which is an OSF1 AlphaServer with a
Quadrics interconnect.  Typically, better performance may be obtained by using
both communications "rails" of the interconnect, which is done by specifying
some job submission options (please see PSC documentation).  Using two rails
will significantly improve performance at 64 or more processors.  In the
timeframe of the pmemd 8 release, there was a problem with the interconnect
system software that made using 2 rails with pmemd unreliable (pmemd could
crash). This problem has been fixed.

Autoconfiguration issues no longer exist.

In the past, the configuration script would autodetect things like the
cpu type for alpha servers, sgi workstations/servers, etc.  This is no longer
done because sometimes pmemd is built on a frontend platform that is different
in significant ways from the intended target.  Thus we now query the user
regarding cpu type for the IBM AIX, Alpha Server, and SGI MIPS platforms.

The Build Process.  A New Dependencies Scheme.

A simpler build process has been introduced for pmemd 9.  The original source
is provided in *.fpp files, which are preprocessed into *.f90 files.  Dependency
management is handled by a simple perl script that analyzes both module and
include (*.i) file dependencies.  There is a requirement that there be only
one module in a source file, and that the source file name and module name
be related such that if the source file name is foo.fpp, the module name will
be foo_mod.  Dependencies are generated by entering 'make depends' in the
directory amber9/src/pmemd.  This should only be necessary to do if you
modify the code.

Pathscale Compiler Linkage Issues.

The Pathscale F90 Compiler produces very high quality fast code, but linkage
can be complicated because different linkage options are chosen at different
sites.  The linkage options you use will be dictated by how the local
MPI implementation F90 interfaces were built.  The default pathf90 linkage
convention is to use g77- or GNU-compatible name mangling; alternatively,
ifort/pgf90 mangling can be used.  The configure script will ask you which
convention you wish to use.  Probably the simplist thing to do is to take the
default, and if you have linkage problems, select the alternative (I build
pmemd at two sites; one uses the default and the other uses the alternative,
and I can never remember which is which).

Beware of Nonstandard MPI Library Installations!

The configure script assumes that lam, mpich, mpich2, mpich-gm, or mvapich will
be installed in a standard manner, with headers in an 'include' subdirectory
and mpi libraries in a 'lib' directory under an mpi home directory which the
configure script requests from the user.  Installation of mpich-gm or mvapich
will also typically require specification of a directory where device libraries
may be found.  Some mpi installers sometimes install the mpi implementations in
nonstandard directories, and sometimes even rename library files.  Also, as new
releases of the generic mpi software occur, sometimes the list of required
libraries changes.  For these reasons, you may have to hand-tweak the config.h
after you run configure (typically, attempt a build and see if it works; if it
doesn't, go looking for the problem).  To determine the directories and
libraries for lam, you can enter 'mpif77 -showme'.  For mpich, mpich2, mpich_gm
or mvapich enter 'mpif77 -link_info'. Note that if there are multiple mpi
implementations installed on the machine, you must be sure that you are
executing the correct mpif77 (doing a 'which mpif77' will show you what you
are running; you then have to ascertain whether that is the right one or not).

Beware of static linkage on Linux systems!

None of the Linux installation scripts use static linkage due to the fact
that there are Linux distributions in which the statically linked threads
libraries are broken.  Basically, if you statically link an executable and
it includes these threads libraries, you are very apt to get a segment
violation.  However, dynamic linkage is also a headache in that the mechanisms
used to find shared libraries during dynamic loading are not all that robust
on Linux systems running MPICH or other MPI packages.  Typically the
LOAD_LIBRARY_PATH environment variable will be used to find shared libraries
during loading, but this variable is not reliably propagated to all processes.
For this reason, for the compilers that use compiler shared libraries (ifort,
pathscale), we use LD_LIBRARY_PATH during configuration to set an -rpath
linkage option, which is reliably available in the executable.  This works
well as long as you insure that the path is the same for all machines running
pmemd.  Earlier versions of ifort actually also set -rpath, but this was
dropped due to confusing error messages when ifort is executed without
arguments.  The static linkage problem has become less common in the time
intervening since the Amber 8 release, but because systems that have these
problems are still "out there", we have opted to be conservative on this issue
for the time being.

Beware of stack overflows!

Well, this should be a problem of the past, as pmemd 9 now resets the stack
limits itself.  If pmemd cannot set the stack limit to a number it believes
to be adequate, it will issue a warning in the mdout file.  If you have
problems with segment violations caused by stack overflows, you may then need
to talk to your system administator about increasing the hard stack limit.

II. Installation of PMEMD on Unsupported Platforms

The configuration process for pmemd is actually fairly simple, and basically
involves providing a number of environment variables for use by the Compile
script in a file named config.h.  These environment variables control the
choice of compilers, compiler and link options, libraries, and definition of
defined constants that are used in conditionally compiled code.  If you have an
unsupported system, it is fairly simple to generate a config.h for some other
system (the closer to your system in characterstics, the better) and then
modify the values as required.  The only hard part is in choosing the correct
defined constants to use for optimizations and compiler compatibility.  This
has actually gotten even more complicated than it was in pmemd 8 because there
are more optimization options.

Communications Optimizations

The default code paths in the PMEMD parallel code make extensive use of 
asynchronous mpi i/o and overlap of communications and computation.  For slower
interconnects, this may not produce the best performance because the
interconnect hardware gets overwhelmed and data retransmission is necessary.
For this reason, we introduced the SLOW_NONBLOCKING_MPI defined constant.  It
is currently used for slow interconnects like gigabit ethernet, and should
probably be tried when porting the code to systems with previously untested
interconnects.  In most instances, I would expect the default code paths that
use asynchronous mpi i/o to have better performance.  In fact, the differences
for gigabit ethernet are now minimal, and this option is being maintained
mostly "just in case".  However, it is still the default for gigabit ethernet
platforms.

Use of the interconnect files for LAM, MPICH, MPICH2, MPICH-GM or MVAPICH

The Linux mpi interconnect implementations are now supported by "interconnect"
data files, eg. config_data/interconnect.mpich, etc.  If a config data file
matching the pattern <architecture opt>.<compiler opt>.<parallel opt> is not
found, the configure script will look for two files:
<architecture opt>.<compiler opt> and
interconnect.<parallel opt>
With this mechanism it is possible to support a wide range of interconnects for
an architecture-compiler combination by only adding one file to config_data.
I would recommend looking at linux_p4.ifort and interconnect.mpich to see how
this is done.  We briefly describe each defined constant below.  These should
of course all be specified as -D<const name> as part of the appropriate
variable. If you have a new processor, it may be best to just pick a
configuration already in use for a similar processor.  The values found in the
release were actually determined to be optimal through extensive testing.

Table of Defined Constants Used in PMEMD:

MPI Defined Constants:

Use in: MPI_DEFINES

MPI - Use this to build a multiprocessor version of PMEMD.

SLOW_NONBLOCKING_MPI - See note above on communications optimizations.

USE_MPI_MODULE - Generally, a simple F77-level interface to MPI is used.
                 However, in some systems (AIX), there may be problems
                 interfacing to MPI if the module definitions are not used.
                 This define is provided to allow usage of the MPI module
                 definitions instead of mpif.h.

Use in: F90_DEFINES

FFTLOADBAL_2PROC - Normally FFT loadbalancing is only done in scenarios where
                   there is clearly the potential to benefit from assigning
                   the FFT workload to a subset of the processors. However, on
                   some systems with slow interconnects, there may be an
                   advantage to assigning all the workload to one of two
                   processors in a two processor run.  This is a loadbalancing
                   scenario, because if performance bottlenecks in the
                   processor with all the FFT workload, FFT slabs can be
                   reassigned.  This is probably worth trying for any slow
                   interconnect.

Direct Force Optimization Defined Constants

The direct force code, where the pairlist is processed to compute direct forces
and energies, is critical to overall pmemd performance.  This code is optimized
differently for different platforms.  The optimizations were originally
developed to take advantage of the characteristics of a given platform, but
with the proliferation of optimizations, the interplay has become difficult to
predict.  For this reason, we typically benchmark all combinations to determine
the best settings.  In past releases the default pmemd build used a pairlist
compression algorithm to reduce the memory required by the pairlist by a factor
of roughly four.  For most machines in common use today and for most problems
currently being attempted, this compression algorithm has a cost in execution
time and it has been dropped from the code base.  The practical fact of the
matter is that for really big problems that would benefit from pairlist
compression, it is best to run at larger processor counts, since the pairlist
is then divided among the processors.

Use in: DIRFRC_DEFINES

DIRFRC_COMTRANS - Take advantage of knowledge that all of an atom's pairs use
                  the same unit cell translation vector.

DIRFRC_EFS - Use a different erfc splining switch than the standard one.  The
             alternative switch avoids the need to use sqrt() for the vast
             majority of direct force pair evaluations, has lower overall
             splining error, thrashes memory less, and is faster on many but
             not all machines.

DIRFRC_NOVEC - Use no vectorization of the direct force workload data in
               evaluating the pairs associated with one atom.  The default
               mechanism is to place data about the pairs in a vector, which
               for many but not all platforms, produces faster code.

SLOW_INDIRECTVEC - On some machines it is actually cheaper to do some
                   unnecessary calculations than it is to introduce an
                   indirection layer in the vectors (basically a set of indexes
                   specifying which entries in the vectors actually need to
                   be used).

HAS_10_12 - Enables use of the Leonard-Jones 10-12 potential instead of the
            Leonard-Jones 6-12 potential.  Little used, and not a performance
            issue...


Compiler Linkage Defined Constants:

Use in: CFLAGS

CLINK_CAPS
DBL_C_UNDERSCORE
NO_C_UNDERSCORE - These options control naming conventions used to link in C
                  code used by PMEMD.  There is very little C code in PMEMD
                  (only pmemd_clib.c), and if you have trouble linking to
                  the routines here, you may need to use one of these defined
                  constants.  Please look at how the defined constants control
                  name definition and what your compiler expects if you have
                  problems.

III. Representative Benchmarks

Attached is a Excel spreadsheet/graph file containing benchmark data for three
larger systems on which I run pmemd (readable with either Excel or OpenOffice
on Linux, see PMEMD_Benchmarks.xls).  The benchmarks include:
1) Human Factor Va, NPT and NVT ensembles.
2) Factor IX, NPT and NVT ensembles.
3) JAC (Joint Amber-Charmm benchmark), NVE ensemble.

These benchmarks span the range of 23,558 to 193,013 atoms, and are all
PME explicit solvent simulations.  The HFVA benchmark was generously provided
by Dr. Chang Jun Lee of Prof. Lee Pedersen's lab, and is not yet generally
available.  the Factor IX benchmark was generously provided by Dr. Lalith
Perera, originally of Prof. Lee Pedersen's lab, and currently at NIEHS in
Dr. Tom Darden's lab.  A variant of this benchmark is used as part of the
Amber benchmark suite.  When benchmarking pmemd, I always run Factor IX in both
NVT and NPT (more demanding) ensembles, whereas the Amber benchmark suite
version is NVT.  I also typically dump trajectory data every 250 steps, which
represents reasonable trajectory overhead, whereas many benchmarks don't write
trajectories.  At low processor count, this does not matter much.  At high
processor count, it is misleading because there can be disk i/o bottlenecking
in the master process.  The three systems benchmarked here include:

bassi.nersc.gov: - An IBM AIX p575 Power 5 (sp5), 1.9 GHz cpu's, 111 nodes with
                   8 cpu's per node - a very well-tuned machine (better than
                   other sp5's to date).

bigben.psc.edu: - A Cray XT3 MPP system, 2.4 GHz Opteron cpu's, 2068 nodes with
                  1 cpu per node.

jacquard.nersc.gov: - A Linux Opteron Infiniband cluster, 2.2 GHz cpu's, 320
                      nodes with 2 cpu's per node.

Thanks to both NERSC and PSC for access to these fine machines.

Regarding NetCDF, these benchmarks were all run before I had NetCDF working,
so they represent results associated with writing formatted trajectory files.
I have since rerun the 320 processor benchmark for Factor IX NVT and NPT on
bassi, and saw the following improvements:

Factor IX NVT, formatted trajectory: 15.4 nsec/day
Factor IX NVT, NetCDF    trajectory: 16.6 nsec/day

Factor IX NPT, formatted trajectory: 13.5 nsec/day
Factor IX NPT, NetCDF    trajectory: 14.6 nsec/day

So what about smaller machines?  Well, we have decent speedups relative to
pmemd 8 on my favorite Pentium 4 3.2 GHz / Gigabit ethernet setup.  The MPI
implementation here is MPICH 1.2.6.

Factor IX NPT , 90906 atoms, 8 Angstrom cutoff, 1.5 fs step, PME

# procs       pmemd 9      pmemd 8
              psec/day     psec/day

1              116           86
2              182          138
4              293          238

JAC NVE, 23558 atoms, 9 Angstrom cutoff, 1 fs step, PME

1              254         185
2              432         315
4              702         572


- Bob Duke
  NIEHS and UNC-Chapel Hill
