Re: [AMBER-Developers] New features in pmemd from Josh Berryman on 2014-02-14 (Amber Developers Archive Feb 2014)

From: Josh Berryman <the.real.josh.berryman.gmail.com>
Date: Fri, 14 Feb 2014 10:38:13 +0100

>>The point here is not so much the emil case but the fact that the ti if
>>statement is there in the first place. Adding a third expression just
>>makes it worse.

I don't want to get drawn into this.... but I am surprised that its an
issue, as I have my first look around the pmemd code for a few years
it appears to me that the policy for ti code has been to wrap it
inside high-level conditionals, eg pairs_calc_ti3 is not called if
there is no TI switched on. If that isn't the case, then yes it
should be. As long as TI is still only for the brave (at lets admit
it, it still isn't a mainstream activity) then isn't our main concern
just to insulate it from normal pmemd runs?

To optimise the code structure of the ti code blocks, then I agree
that a fixed-size array of flags is wrong when most of them will be 0
anyway, although it is very fortranny. For normal softcoring, a good
approach might be to have three lists of (appearing/disappearing/do
nothing) atoms, or possibly just repack the order of atom ids and save
the positions of the boundaries; and then build two separate sets of
neighbour lists for interactions which are softcored or not. This
would fit nicely with emil_sc as it would just be the case that the
list of non-sc interactions would be empty.

>>You guys realize that there is no short circuit evaluation in Fortran
>>right?

Actually I had no idea.... long-time c programmer here. Thanks for the tip.

So, if we switch:

if (ti_sc_lst(real_i)+ti_sc_lst(real_j) .eq. 2)
       do stuff
end if

for:

if( ti_sc_lst(real_i) .eq. 0 ) then
  if ( ti_sc_lst(real_j) .ne. 0 ) then
      do stuff
  end if
else if( ti_sc_lst(real_j) .ne. 2 ) then
  do the same stuff
endif

For (one) we have an add, and then a compare-and-branch on the result.

For (two) we have a compare-and-branch, then another
compare-and-branch. Is this really better? To me, there is a bit
more of a code-load and the same number of atomic operations.

>>And why is there arithmetic inside the if statement! - These things should
>>at the absolute least be booleans!!!

Going back to Thomas or whoever implemented this stuff originally,
probably they were thinking in c as well, where bool is just a handy
alias for int. Perhaps there is indeed something to be gained by
using F90's LOGICAL datatype... maybe some compilers/chips will even
store them as bytes, so that you can load a block of 8 at a time. Now
that we've agreed that there is scope for much further-reaching
optimisation though, I'm not sure there is any point making that
tweak.

Josh

On 13/02/2014, Ross Walker <ross.rosswalker.co.uk> wrote:
> You guys realize that there is no short circuit evaluation in Fortran
> right?
>
> So
>
> if (i == 1 .and. j == 1) then
>
> will always be slower than
>
> if (i == 1) then
> if (j ==1 ) then
>
> And arranging this such that the most likely false evaluation is first is
> optimal.
>
> The point here is not so much the emil case but the fact that the ti if
> statement is there in the first place. Adding a third expression just
> makes it worse.
>
> And that fact that these are evaluations of an integer array is even worse
> since the compiler can't factor it out of the loop here.
>
> And why is there arithmetic inside the if statement! - These things should
> at the absolute least be booleans!!!
>
> /fail. :-(
>
>
>
>
> On 2/13/14, 12:25 AM, "Josh Berryman" <the.real.josh.berryman.gmail.com>
> wrote:
>
>>>>Sure, CPUs are really good at this sort of thing, and outside the inner
>>>>loops I don't care about them. But here, I do.
>>
>>Absolutely, every time there is an if or a divide inside a loop over N
>>a baby seal dies.
>>
>>I prevailed on Joe to add the emil_sc feature for me as someone
>>mentioned that he had his fingers in the ti code already. I find it
>>immensely useful and like it a lot, but yes, ideally there would be a
>>longer "soak" period where I have a look at the changes made (in parts
>>of the code which I would not really have been confident myself to
>>start the ball rolling in), explore performance and look for
>>optimisations.
>>
>>I know that it is getting late in the day but it seems to me that
>>there is a really obvious fix for this particular extra cost, so no
>>need to mess around with compile flags in the name of performance.
>>
>>If instead of 0 or 1 as the ti_sc_lst flags we use 0 or 2 (or 1 for
>>emil_sc) then
>>the test goes from:
>>
>>if ((ti_sc_lst(real_i)+ti_sc_lst(real_j) .eq. 1) .or. &
>> (emil_sc_lcl .eq. 1 .and.
>>ti_sc_lst(real_i)+ti_sc_lst(real_j) .ge. 1))
>>
>>to:
>>
>>if (ti_sc_lst(real_i)+ti_sc_lst(real_j) .eq. 2)
>>
>>which is back to the efficiency that we had before the change. I am
>>happy to implement this and commit, after all the code freeze is
>>tomorrow, not today, right?
>>
>>Josh
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>The smallest change to give an efficiency
>>
>>
>>On 13/02/2014, Scott Le Grand <varelse2005.gmail.com> wrote:
>>> David,
>>> When I start seeing global switch tests in inner loops:
>>>
>>> else if ((ti_sc_lst(real_i)+ti_sc_lst(real_j) .eq. 1) .or. &
>>> *(emil_sc_lcl .eq. *1 .and.
>>> ti_sc_lst(real_i)+ti_sc_lst(real_j) .ge. 1)) then
>>> if (ti_lst(1,real_i)+ti_lst(1,real_j) .ge. 1) then !TI region
>>>1
>>>
>>> this has always been a reliable warning sign for me in the past that the
>>> overall structure of the code needs reorganization.
>>>
>>> The GPU code does this sort of thing at compile time and that's where
>>> decisions like this belong IMO. And it's not any one if/then that is
>>>the
>>> problem but rather that these guys tend to reproduce like bunnies once
>>>they
>>> appear.
>>>
>>> Sure, CPUs are really good at this sort of thing, and outside the inner
>>> loops I don't care about them. But here, I do.
>>>
>>> OTOH the more there are, the more I can refactor them out of GPU
>>> incarnations and that's free performance beer.
>>>
>>> That and I'm on the verge of rethinking the GPU code itself because so
>>>many
>>> assumptions that were valid in 2010 no longer hold.
>>>
>>> Scott
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 12, 2014 at 6:46 PM, David A Case
>>> <case.biomaps.rutgers.edu>wrote:
>>>
>>>> On Wed, Feb 12, 2014, Jason Swails wrote:
>>>> > >
>>>> > > What's left in Sander that people crave for in PMEMD?
>>>> > >
>>>> >
>>>> > QM/MM.
>>>>
>>>> Certainly, QM/MM is in sander but not pmemd. But do people "crave for
>>>> it"
>>>> to be added to pmemd? The usual argument is that the QM part quickly
>>>> becomes
>>>> the time bottleneck, so the fact that MM is faster in pmemd becomes
>>>> irrelevant -- there is little need (yet) to think about porting QM/MM
>>>>to
>>>> pmemd. (I could be wrong here.)
>>>>
>>>> Similar arguments apply for PBSA and RISM: these are both in sander,
>>>>but
>>>> are slow, and limit performance. Again, any speed advantage PMEMD has
>>>> for
>>>> the other parts of the calculation are minimal. So there is no hurry
>>>>to
>>>> put PBSA or RISM into pmemd.
>>>>
>>>> The emap functionality is similar, but maybe not identical, since it is
>>>> not
>>>> as big a time-hog as QM, PBSA or RISM. So far, I don't think most emap
>>>> calculations (and there aren't very many of them, since only a few
>>>>people
>>>> are using this capability) need highly efficient MM. But there is more
>>>> of
>>>> an argument here that emap capability in pmemd might be desirable.
>>>>
>>>> On the other hand, I don't really see Scott's argument, that emap (et
>>>>al)
>>>> somehow make it hard to keep going on pmemd.cuda development. It's
>>>>quite
>>>> modular, and there is a simple if-statement in mdin to disable this for
>>>> GPUs.
>>>> We've always had a list of things that aren't supported for GPUs;
>>>>having
>>>> a
>>>> few more entires in that list doesn't seem that bad (to me). We're
>>>> learning
>>>> how to document this better, both in the code and in the Manual.
>>>>
>>>> Scott says he wants "more control ... on feature additions that will
>>>>run
>>>> on the GPU." We *should* establish a better mechanism for suggesting
>>>>and
>>>> vetting requests for new GPU features. But I don't (yet) see that
>>>> another
>>>> code fork is needed to accomplish this.
>>>>
>>>> ...dac
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER-Developers mailing list
>>>> AMBER-Developers.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>>
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>
>>
>>_______________________________________________
>>AMBER-Developers mailing list
>>AMBER-Developers.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Feb 14 2014 - 02:00:03 PST