Re: amber-developers: PIMD, NEB, LES - request for code inspection and tests from Wei Zhang on 2007-07-24 (Amber Developers Archive Jul 2007)

From: Wei Zhang <zweig.scripps.edu>
Date: Tue, 24 Jul 2007 10:56:53 -0500

Ross Walker wrote:

>Hi Carlos,
>
>
>
>>I've been discussing this with Ross and Dave M.
>>I'm making changes needed to have NEB work with explicit water,
>>which it won't currently do.
>>
>>
>
>Now I have some time to stop and look at this I remember that there actually
>exists a version of NEB that works as part NEB with explicit water etc. I'm
>not sure that it made it into Amber 9 but I believe it is in the Amber 10
>tree somewhere - likely just before it was converted over to multisander. I
>don't know exactly where in the tree this is - it is likely around revision
>9.13 of sander:
>
>revision 9.13
>date: 2006/10/17 15:58:02; author: zweig; state: Exp; lines: +3 -17
>New pimd implementation based on multi-sander framework. Meanwhile, remove
>the following exectables: sander.PIMD, sander.CMD, sander.PIMD.A1ST,
>sander.CMD.A1ST from compilation
>
>I.e. just before this. Wei implemented this since the way NEB worked then,
>through PIMD, it seemed quite easy to do. The approach was that for partial
>NEB the non-replicated sections saw the average force of all of the replicas
>- I believe this is the same way LES works?
>
Actually there is a slignt difference between the LES-pme and PIMD-pme
implementation. PIMD acted in the way you mentioned, that non-replicated
sections saw the average force of all the replicas. LES-pme implementaion
is more complicated then that. I think Carlos has a paper in JACS discussing
this problem. It is complicated because it can handle multiple-LES region,
i.e. you can two regions which were replicated respectively.

>
>The aim of moving to multisander and ditching all of these extra options was
>so that the communication overhead of having everything replicated, which
>stemmed from the original sander (egb) implementation, where everything was
>replicated anyway so it made no difference what approach was taken with
>regards to distribution of coordinates etc, but this was never completed.
>Wei moved NEB out of PIMD and into multisander but left it with the
>mpi_bcast's etc. What it now needs is modifying so that communications of
>coordinates and forces only occurs between immediate neighbours - possibly
>in only one direction but I'd have to check this. Ideally this should be
>done with a non-blocking approach and then computation can be overlapped
>with the communication and we should be able to scale to huge numbers of
>processors, I would expect at least 16 to 32 cpus per replica...
>
>
In fact I did not move NEB to multi-sander. the old code is left there
since I think we might need partial NEB someday, since we already have
partial PIMD working well, it won't be difficult to implement partial NEB,
one just need to implement the subroutine part_neb_forces() according
to full_neb_forces().

>But yes this does require changing the way in which NEB operates. At the
>same time the output could be greatly reduced since at the moment there is
>massive duplication of data in all the nreplica outputs - plus there is a
>ton of multisander debug stuff being written to standard out by Amber 10
>that should probably be turned off. - Plus the way the timings are done at
>the end of the run needs to be radically changed to avoid bottlenecks - for
>example running on 2048 cpus of abe at the moment it requires almost 30
>minutes at the end of a run just to collate all the timings and write the
>profiling etc. We need some way of turning this off when you don't need the
>debugging info - I.e. Only the master thread writes the timings and then it
>just writes it's own timings rather than the average over all nodes. - Or we
>just need a much smarter way of calculating the average.
>
>All of this has been on my todo list for ages :-( So Carlos if you want to
>try and do it then that is great and I am happy to help out as I can.
>
>On an associated note I think we need a huge audit of the parallel code
>since there is so much uneccessary communication going on in certain parts
>of the code... We really should have some guidelines on use of MPI - I.e. it
>is NOT okay to be using ANY of the all_to_all communicators or bcasts
>outside of the initial setup unless you really really really can justify the
>need for them. And "because it makes the code easier to read" is really not
>a justification...
>
>Anyway, more stuff for the to do list - maybe we need a week long code
>retreat somewhere where a number of us get together and do nothing but work
>on cleaning up the code. We could tag this onto the end of the developers
>meeting if anyone wants to stay in San Diego for longer.
>
>All the best
>Ross
>
>/\
>\/
>|\oss Walker
>
>| HPC Consultant and Staff Scientist |
>| San Diego Supercomputer Center |
>| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>| http://www.rosswalker.co.uk | PGP Key available on request |
>
>Note: Electronic Mail is not secure, has no guarantee of delivery, may not
>be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
>
Received on Wed Jul 25 2007 - 06:07:36 PDT