Re: amber-developers: ene and ener array mess in sander from Robert Duke on 2008-11-22 (Amber Developers Archive Nov 2008)

From: Robert Duke <rduke.email.unc.edu>
Date: Sat, 22 Nov 2008 09:46:51 -0500

HI Ross,
I think a lot of this has to do with the layering of pme on top of an
already-existing energy array structure, and it is indeed a mess. Tom D.
would have to comment, but I think he attempted to get some isolation from
the rest of the sander code with the second "virtual" ene array at an offset
into ener(). I dredged through this all so long ago that I don't recollect
the sander specifics, but I DID clean up the logical structure for all this
stuff somewhat in pmemd. By looking at that code, starting in runmd and
sinking into pme_force (vs. gb_force), you may be able to see how I sorted
some of this stuff out, and that may give you some hints, or possibly
confuse you further. I also created "state information" structures to
contain a lot of this junk, specific to the electrostatics implementation
(pme vs. gb). Thus, I don't deal with arrays with only pieces of the data
relevant for a particular run. It could be that I have abstracted things so
far away from the original sander implementation that it won't be that much
help in the current pmemd code; you may instead want to look at amber 8
pmemd, where I still have the original arrays, and where a lot of the sander
accretions are stripped away, leaving the core of what is necessary for pme
(and I added some comments as I figured things out). The most relevant
files in pmemd 8 are runmd_cit.f90 and force_cit.f90. Some of the array
copies of course deal with energy avgs, but I presume there may be some
stuff there from TI, possibly other stuff that is specific to added
functionality. When/if time comes to add charmm stuff to pmemd, I would
like to do it. I probably will also reimplement the scee/scnb stuff you
added earlier to improve cache utilization (most noticeable problem I have
seen, but I have not studied it a bunch - basically stuff is getting knocked
out of the cache by some of this stuff though). Sander needs a total
rewrite; of course noone has the time or money, so we will just all pay the
price in lost productivity a little bit at a time, every day.
Regards - Bob

----- Original Message -----
From: "Ross Walker" <ross.rosswalker.co.uk>
To: <amber-developers.scripps.edu>
Sent: Saturday, November 22, 2008 1:06 AM
Subject: amber-developers: ene and ener array mess in sander

> Hi All,
>
> Does anyone actually understand how the crazy ene / ener mess in sander
> actually works?
>
> I am trying to add support for charmm impropers and urey-bradley terms to
> this array but it seems to be a complete and utter disaster.
>
> Lets start in sander.f:
>
> _REAL_ ener(30)
> _REAL_ ene(51)
>
> call force(x,ix,ih,ipairs,x(lcrd),x(lforce),ener,vir, &
> x(l96), x(l97), x(l98), x(l99), qsetup, &
> do_list_update,n_force_calls)
>
> So it passes ener (of size 30 into force).
>
> It also passes ene to a ton of other routines all of which have ene either
> hard coded as 51 or assumed size.
>
> So in runmd we have a series of arrays all of fixed size 51:
>
> _REAL_ enert(51),enert2(51),ener(51)
> _REAL_ enert_old(51),enert2_old(51),ecopy(51),edvdl(51), &
> enert_tmp(51),enert2_tmp(51),etot_start,edvdl_r(51)
>
> Not a comment in site for what the hell any of these are for or what they
> contain.
>
> So then we have the call to force inside runmd:
>
> call force(xx,ix,ih,ipairs,cartpos,f,ener(23),vir, &
> xx(l96),xx(l97),xx(l98),xx(l99), qsetup, &
> do_list_update,nstep)
>
> So here it passes in ener with an OFFSET of 23!!! <sigh...!> - which as it
> is defined as a fixed size of 51 means it is of size 29 inside force -
> even
> though the ener passed to force from sander was of size 30!!!
>
> So then inside force we have:
>
> _REAL_ ener(*) ! offsets in this ener array = offsets in runmd ener - 22
> _REAL_ ene(30)
>
> So ener(*) is actually of size 29 inside force since it was passed in with
> an offset of 23 and is size 51 outside (actually ener in force is of size
> 30
> when called directly from sander since it was passed sander's ene here).
>
> Tracing ener through force.f we see that typically it is used up to value
> 28. However on line 1165 it is passed to cns_xref_run:
>
> call cns_xref_run(natom,ih(m04), x,f,ener)
>
> Now looking inside cns_xref_run we have:
>
> subroutine cns_xref_run(natom,ih,x,fg,ener)
> _REAL_, intent(inout) :: ener(*)
> ...
> call rdxraycns(natcns,medcns,ener,fg)
> ...
> ener(16) = ener(47)
> ener(18) = ener(49)
> ener(19) = ener(50)
>
> So cns_xref_run accesses elements 47, 49 and 50 of ener even though ener
> is
> only of size 1:29 in this scope! DOH!.
>
> Then it gets EVEN WORSE:
>
> inside rdxraycns:
>
> subroutine rdxraycns(natcns,medcns,ener,fg)
>
> implicit none
> _REAL_ ener(*),fg(*),xgrd1,xgrd2,xgrd3
> ...
> read(ipdb,'(E20.14)') ener(47)
> read(ipdb,'(A80)') line
> read(ipdb,'(E20.14)') ener(48)
> read(ipdb,'(A80)') line
> read(ipdb,'(F8.6,1X,F8.6)') ener(50),ener(49)
>
> So this actually writes to elements ener(50), 49 etc etc. This is
> completely
> illegal as far as I can tell and is just corrupting memory somewhere - I
> am
> shocked it doesn't segfault.
>
> And this is JUST TOUCHING THE SURFACE of this ene / ener mess!
>
> Then in evb_ntrfc we have ener passed in from force and lines like:
>
> ener(1:26) = 0.0d0
>
> Any reason why it is zeroing just elements 1:26 of ener here and not the
> full 1:29 (or 1:30 if force was called from sander)?
>
> Then on line 1206 of force we have:
>
> if (master) call mpi_reduce(ener,totener(23),28,MPI_DOUBLE_PRECISION,
> &
> MPI_SUM,0,commmaster,ierr)
>
> So only elements 1 to 28 of ener here are reduced.
>
> Then inside fdist (called from force with ene) we have:
>
> do i = 2,30
> f(j+1+i) = ene(i)
> end do
>
> so ene is reduced from 2 to 30 even though as far as I can tell it is only
> used up to element 28.
>
> and then we have horrendous statements like:
>
> call
> mpi_allreduce(MPI_IN_PLACE,f(j),33,MPI_DOUBLE_PRECISION,mpi_sum,commsander,i
> err)
> vir(1) = f(j)
> vir(2) = f(j+1)
> vir(3) = f(j+2)
> do i = 2,30
> ene(i) = f(j+i+1)
> end do
>
> where the 33 is hard coded - change the size of ene and you get screwed
> but
> you would be hard pressed to find this given it is hard coded as 33
> (rather
> than the 30 for ene) since it also contains the virials etc...
>
> So, having spent 4 hours this evening trying to extend these arrays to
> include the charmm urey-bradley and improper terms and finding that if I
> change anything I end breaking things either in parallel, serial or both I
> am hoping that someone here has a decent understanding of how these ene /
> ener arrays are actually supposed to work.
>
> And the memory errors in cns_xref etc (and possibly others elsewhere that
> I
> haven't found) should be fixed.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
Received on Fri Dec 05 2008 - 16:21:59 PST