RE: amber-developers: PIMD, NEB, LES - request for code inspection and tests from Ross Walker on 2007-07-24 (Amber Developers Archive Jul 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 24 Jul 2007 08:31:31 -0700

Hi Carlos,

> I've been discussing this with Ross and Dave M.
> I'm making changes needed to have NEB work with explicit water,
> which it won't currently do.

Now I have some time to stop and look at this I remember that there actually
exists a version of NEB that works as part NEB with explicit water etc. I'm
not sure that it made it into Amber 9 but I believe it is in the Amber 10
tree somewhere - likely just before it was converted over to multisander. I
don't know exactly where in the tree this is - it is likely around revision
9.13 of sander:

revision 9.13
date: 2006/10/17 15:58:02; author: zweig; state: Exp; lines: +3 -17
New pimd implementation based on multi-sander framework. Meanwhile, remove
the following exectables: sander.PIMD, sander.CMD, sander.PIMD.A1ST,
sander.CMD.A1ST from compilation

I.e. just before this. Wei implemented this since the way NEB worked then,
through PIMD, it seemed quite easy to do. The approach was that for partial
NEB the non-replicated sections saw the average force of all of the replicas
- I believe this is the same way LES works? This I think worked with
explicit solvent PME and QM/MM as long as the QM region was entirely within
the replicated NEB section. I don't know if it was ever formally tested
though over and above just checking it worked on a toy case. Really the
approach needs properly validating since I think there needs to be careful
thought put into how the non-replicated section is dealt with. For example
suppose we have a long chain in explicit water that is completely stretched
out at one end point and folded up into a ball at the other end point. The
change in solvent distribution during the pathway will be huge and with only
one replication of solvent you will get vacuum bubbles along the pathway
etc. Plus in some cases I image you'd get huge VDW forces that may require
something like soft core in order to work. This is all speculation though as
I don't think it was ever tested.

The aim of moving to multisander and ditching all of these extra options was
so that the communication overhead of having everything replicated, which
stemmed from the original sander (egb) implementation, where everything was
replicated anyway so it made no difference what approach was taken with
regards to distribution of coordinates etc, but this was never completed.
Wei moved NEB out of PIMD and into multisander but left it with the
mpi_bcast's etc. What it now needs is modifying so that communications of
coordinates and forces only occurs between immediate neighbours - possibly
in only one direction but I'd have to check this. Ideally this should be
done with a non-blocking approach and then computation can be overlapped
with the communication and we should be able to scale to huge numbers of
processors, I would expect at least 16 to 32 cpus per replica...

But yes this does require changing the way in which NEB operates. At the
same time the output could be greatly reduced since at the moment there is
massive duplication of data in all the nreplica outputs - plus there is a
ton of multisander debug stuff being written to standard out by Amber 10
that should probably be turned off. - Plus the way the timings are done at
the end of the run needs to be radically changed to avoid bottlenecks - for
example running on 2048 cpus of abe at the moment it requires almost 30
minutes at the end of a run just to collate all the timings and write the
profiling etc. We need some way of turning this off when you don't need the
debugging info - I.e. Only the master thread writes the timings and then it
just writes it's own timings rather than the average over all nodes. - Or we
just need a much smarter way of calculating the average.

All of this has been on my todo list for ages :-( So Carlos if you want to
try and do it then that is great and I am happy to help out as I can.

On an associated note I think we need a huge audit of the parallel code
since there is so much uneccessary communication going on in certain parts
of the code... We really should have some guidelines on use of MPI - I.e. it
is NOT okay to be using ANY of the all_to_all communicators or bcasts
outside of the initial setup unless you really really really can justify the
need for them. And "because it makes the code easier to read" is really not
a justification...

Anyway, more stuff for the to do list - maybe we need a week long code
retreat somewhere where a number of us get together and do nothing but work
on cleaning up the code. We could tag this onto the end of the developers
meeting if anyone wants to stay in San Diego for longer.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
Received on Wed Jul 25 2007 - 06:07:35 PDT