Re: [AMBER-Developers] Desmond SUX from David Cerutti on 2021-05-14 (Amber Developers Archive May 2021)

From: David Cerutti <dscerutti.gmail.com>
Date: Fri, 14 May 2021 04:36:55 -0400

So 8x8 tiles is actually the best way to go here, each warp does two of
them at a time. What I was saying earlier.

But how do we best use the tensor cores to build the pair list? If we have
four atoms with coordinates a1 = { x1 y1 z1 }, a2 = { x2 y2 z2 }, ..., a4 =
{ x4 y4 z4 } and another four b5 = { x5 y5 z5 }, ..., b8 = { x8 y8 z8 },
then we can make matrices A = [ x1 y1 z1 0 ; x2 y2 z2 0 ; ... ; x4 y4 z4 0
] and B = [ x5 x6 x7 x8 ; y5 y6 y7 y8 ; z5 z6 z7 z8 ; 0 0 0 0 ] such that
A*(A') - 2*A*B + (B')*B = C, with C containing the interatomic distances.
Is there anything more efficient we can do, to squeeze out the last 1/4 of
the FLOPS in that tensor computation?

On Fri, May 14, 2021 at 12:28 AM David Cerutti <dscerutti.gmail.com> wrote:

> My feeling, which I have to qualify a great deal as I am not in the MD
> coding business for now, is that you could see great benefits from 8 x 32
> tiles, perhaps getting down to 8x16 tiles for the tail end of your
> non-bonded blocks. The problem becomes one of computing the pairlist, and
> I think that a pairlist in the general form of the one you have is
> something to keep. Dropping vdW in a strawman code will indeed earn the
> wages of Jack Schlitt, I can confirm that I did my own strawman of this.
> However, there are significant gains to be made if one can separate
> electrostatic and LJ interactions, due to the nature of all those water
> models. TIP4P-based systems can go from 16 interactions per water to 10,
> and TIP3P-based systems can benefit by keeping the electrostatic cutoff
> small while moving to a reasonable vdW cutoff--think 8A and 10A,
> respectively. Relative to the existing pmemd, a good approach could
> accelerate the code by as much as 20%. The trick is to make the pairlist
> in a rapid manner. Absolutely I would look into using 16-bit tensor cores
> to apply the distance masks. But, again, I'd target 8x16, 8x32 tiles, the
> 8x16 case being reserved for cases where there are less than 16 atoms left
> to interact with the home eight.
>
> I'm still thinking that cleaner code resulting from a non-bonded strategy
> that makes the neighbor list re-arrangement the exception rather than the
> rule would be an overall benefit due to the ease of getting more developers
> into the code by letting them easily interact with the atoms in an order
> they can understand. To do that, one would need the neutral territory
> methods that I outlined earlier in our private correspondence, and thread
> blocks of perhaps 3x448 per SM or 2x704 per SM should fit within the
> register limit. Making the neighborlist arrangement unique to the
> non-bonded kernel would require a __syncthreads() at the end of each
> neutral territory region, which would hurt less if you stuck all of the
> 8x16 tiles at the back for idle warps to backfill in the most efficient
> manner possible, but I think each warp would get on average 4-6 8x32 tiles
> with a portion of 8x16 tiles to divvy up, which would probably cost the
> non-bonded kernel 10-15% and the code 4-5% of its overall speed. Again,
> accessibility versus performance.
>
> There are lots of things Amber can do to get better. DAC did very well to
> bring (and keep) people together, and produce the world's leading academic
> molecular dynamics tools. Whoever takes the reins next should give some
> serious thought to structural changes in the code base and collaborative
> nature of the consortium, however.
>
> Dave
>
>
> On Thu, May 13, 2021 at 11:00 PM Scott Le Grand <varelse2005.gmail.com>
> wrote:
>
>> I have some broad questions about this dance they danced with fmaf
>> etc.NVCC
>> (also known as the code cuisinart) will reduce your fancy cycle counted
>> code to a DAG and then re-express it in its own vision of optimality from
>> the PTX. Attempting fine-grained cycle dances is rather pointless. You can
>> play with functional unit utilization, memory throughput and you can even
>> do some warp specialization, but unless you're coding up in SASS, your
>> intent will be re-expressed according to the whims of the compiler. I have
>> found sweating the details of data structures and stealing tricks from
>> good
>> old fashioned CS, good old fashioned CG, or even my SASA paper much more
>> fruitful avenues of pursuit. That said, if you're talking about ANTON HW,
>> oh yeah, brilliant work and it's sad no one has built a competitor to it.
>>
>> I got told recently that the cool kids zero out LJ interactions for
>> solvent. So I coded up a strawman. It maybe gets 2% speedup overall at the
>> expense of a lot more bookkeeping.
>>
>> The newest thing is to drop exclusions entirely and just subtract them
>> later. But that has associative roundoff error if one is not careful.
>>
>> What I need to know here is implementation details. You and I are both
>> toying with dropping neighbor lists entirely but in different ways.
>>
>> And it's not that I don't think their tricks are kosher so much as I think
>> they should be disclosed upfront rather than having to dig for them.
>>
>> Scott
>>
>> On Thu, May 13, 2021 at 3:03 PM David Cerutti <dscerutti.gmail.com>
>> wrote:
>>
>> > For reference, here are the benchmarks that I think people are talking
>> > about:
>> > Desmond-GPU Performance as of April 2021.pdf (deshawresearch.com)
>> > <
>> >
>> https://www.deshawresearch.com/publications/Desmond-GPU%20Performance%20as%20of%20April%202021.pdf
>> > >
>> >
>> > Desmond uses a different long-ranged summation, the "U-series" which
>> was a
>> > bit of a dance of the seven veils and then turned out to be very
>> similar to
>> > other P3M techniques, SPME included. The U-series was the way they go
>> to
>> > 8fs between updates to the long-ranged component of the electrostatics.
>> > Regardless of what it is, though, I'll say that my own experiences in
>> > multiple time stepping (see mdgx GB) don't leave much room to go higher
>> > than 5fs in any component of the force. Long ago, circa 2015 their DHFR
>> > benchmark was much faster than Amber (the 50% Scott is alluding to),
>> which
>> > seems to be a ratio they have maintained over the years, but it's now
>> more
>> > in line with the rest of the benchmarks--one can compute the number of
>> > atoms moved by the code in a given time and see that the STMV case is,
>> > indeed, moving substantially more than DHFR. It's pretty impressive
>> that
>> > they can do ten million atoms, but of course that's more of a stunt (I
>> > would have been more inclined to do eight virions in a cube). That
>> said,
>> > the Desmond folks do some pretty wild orchestration of how many fmafs
>> and
>> > other arithmetic ops they can pull out of each cycle, so while their
>> > numbers may be tweaked according to any given standard my feeling is
>> that
>> > "sales" are not a big incentive for them to cook the books.
>> >
>> > You can surely get more performance out of pmemd on the smaller systems
>> if
>> > you multiply the systems it simulates at one time. 2300ns per day with
>> > DHFR on one of the top-end Ampere cards shouldn't be out of the
>> question.
>> > This should be one of the highest priorities in any renovations to the
>> > engine, as most pharma outfits study problems of 25-50k atoms, must run
>> > many windows before getting a single answer, and always have more
>> compounds
>> > to test than GPUs to do it. What I would also suggest is that anything
>> > happening to pmemd's CUDA component is stuck behind some very old
>> Fortran
>> > code, with Pieces of a System flying around in a manner that's almost as
>> > depressing as the film with Vanessa Kirby. Rebuild the 100k lines of
>> > Fortran in C++ with accessible, well-engineered structs that are hard to
>> > break. Topologies, coordinates, and simulation protocols can all be
>> > structs passed around and created or destroyed as needed by a protocol.
>> > Give them each pointer structs that can be copied to the GPU in a manner
>> > analogous to cSim today, or preferably as multiple, focused pointer
>> structs
>> > that become kernel arguments when the actual kernel is launched (the
>> > long-ranged electrostatic kernel doesn't need to know about the bonded
>> > parameter constants, for example--a Prmtop struct can have multiple
>> pointer
>> > substructures tailored for different parts of the force calculation).
>> Make
>> > the kernels for producing work units operate on arrays of such structs,
>> so
>> > that a force kernel will seamlessly stride from one system to the next
>> as
>> > it plays its part in any given time step. You should const as much as
>> > possible but const auto may be something to use sparingly, so that new
>> > developers will become better immersed in the actual nuts and bolts of
>> the
>> > code by seeing the actual data types. That will give upcoming
>> > graduate students more to work with and help them to understand the CUDA
>> > code as something much more C / C++ -like.
>> >
>> > Don't gnash your teeth over what DE Shaw's guys have achieved. The
>> things
>> > that drive sales are utility and unique capabilities, two things that
>> Amber
>> > has done pretty well with despite being the product of a handful of
>> > research groups who mostly prefer to see everyone staying in their
>> > respective lanes. Standardize what a "topology" is and make a clean,
>> > efficient, extensible tool for creating systems. That should be the
>> first
>> > stop for anyone thinking of adding new components to the force field or
>> a
>> > free energy protocol. Document the hell out of everything. Stop
>> relying
>> > on one Bob, or Scott, or me, or Taisung, or Scott again to
>> > MakeABetterEngine.cu. That needs to be a community activity, and it
>> will
>> > improve the employment prospects of your students to have them involved
>> in
>> > professional python / C++ / CUDA programming. Be honest about your
>> > benchmarks and make a new section of the website as an exposition of
>> > Amber's free energy capabilities. It shouldn't take five years for
>> > advertising that doesn't support the group interest to be taken off the
>> > website, or for a researcher with unique ideas and much stronger
>> > associations to the consortium to finally get priority over an
>> > undergraduate who left the group years earlier. Even an academic
>> > organization with $350,000 annual revenue shouldn't continue to rely on
>> a
>> > former member to donate his time and money just to keep their CI up and
>> > running, regardless of his generosity in doing so. The DE Shaw Group
>> is a
>> > professional organization of extremely talented, probably overworked
>> > individuals united by their goals of advancing molecular simulations.
>> Stop
>> > comparing the benchmarks unless you want to start comparing the
>> > organizations.
>> >
>> > Dave
>> >
>> >
>> > On Thu, May 13, 2021 at 4:48 PM Scott Le Grand <varelse2005.gmail.com>
>> > wrote:
>> >
>> > > To me, it's a sales trick until they demonstrate numerical stability
>> to
>> > the
>> > > level Ross and I did with SPFP and SPDP. Have they? But even if it's
>> not
>> > > that stable, at least customers can make an informed choice with such
>> > data,
>> > > no? Also, how often are they rebuilding the neighbor list? Is it a
>> fixed
>> > > interval like GROMACS or is there a skin test?
>> > >
>> > > I am rethinking all this currently and I have friends who think
>> Neighbor
>> > > lists are obsolete if we move to higher timesteps and larger nonbond
>> > > cutoffs, but that brings us to how do we handle exclusions and that's
>> a
>> > > rabbit hole. But... Coincidentally, SPFP's perfect force conservation
>> can
>> > > let you add and subtract them if you cap their magnitudes or use some
>> > > variant of softcore to control dynamic range. But are they doing
>> anything
>> > > like this? Details are everything!
>> > >
>> > > On Thu, May 13, 2021 at 1:39 PM Michael R Shirts <
>> > > Michael.Shirts.colorado.edu> wrote:
>> > >
>> > > > > and they skipped calculating the Ewald Sum every other iteration
>> > > (thanks
>> > > > Adrian!).
>> > > >
>> > > > In their semi-defense, IIRC, their default on all DESMOND
>> simulations
>> > for
>> > > > a while has been to do multiple timestepping of forces, including
>> Ewald
>> > > sum
>> > > > every other timestep. It's not entirely clear to me if this is
>> > > sufficiently
>> > > > accurate, and they definitely should make that clearer that they are
>> > > doing
>> > > > something different, but it's a valid approach (that more people
>> should
>> > > be
>> > > > investigating!) and it's not just a sales trick. Not that there
>> aren't
>> > > > also sales tricks out there.
>> > > >
>> > > > Best,
>> > > > ~~~~~~~~~~~~~~~~
>> > > > Michael Shirts
>> > > > Associate Professor
>> > > > michael.shirts.colorado.edu
>> > > > http://www.colorado.edu/lab/shirtsgroup/
>> > > > Phone: (303) 735-7860
>> > > > Office: JSCBB C123
>> > > > Department of Chemical and Biological Engineering
>> > > > University of Colorado Boulder
>> > > >
>> > > >
>> > > > On 5/13/21, 1:27 PM, "Scott Le Grand" <varelse2005.gmail.com>
>> wrote:
>> > > >
>> > > > So, we're all getting our knickers in a bunch over an Apples to
>> > > Oranges
>> > > > Desmond to AMBER performance comparison.
>> > > >
>> > > > Please don't...
>> > > >
>> > > > They cheated, because that's what they do to keep their
>> investors
>> > > > happy.
>> > > > They used a 32^3 grid, and they skipped calculating the Ewald
>> Sum
>> > > every
>> > > > other iteration (thanks Adrian!). Rather than get upset here,
>> point
>> > > and
>> > > > laugh at DE Shaw et al. that they are afraid to go head to head
>> > with
>> > > > AMBER,
>> > > > and if they do (and they won't because they're chicken bawk bawk
>> > > > bawk), we
>> > > > have the people to address that as well.
>> > > >
>> > > > At our end, there's a ~50% or so performance deficit in AMBER
>> 20 we
>> > > > need to
>> > > > fix. I've already fixed 2/3 of that building PMEMD 2.0 (770
>> ns/day
>> > > > DHFR 2
>> > > > fs already). Let them prance about with their greasy kids stuff
>> > > > desperate
>> > > > approximations and cheats, SPFP remains performance and accuracy
>> > with
>> > > > compromise and if they want to pick a fight with SPFP, make
>> them do
>> > > the
>> > > > work to demonstrate equivalent numerical stability (spoilers:
>> they
>> > > > won't
>> > > > because they can't but oh the bellyacheing and handwaving they
>> are
>> > > > willing
>> > > > to do, just watch).
>> > > >
>> > > > Scott
>> > > > _______________________________________________
>> > > > AMBER-Developers mailing list
>> > > > AMBER-Developers.ambermd.org
>> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
>> > > >
>> > > > _______________________________________________
>> > > > AMBER-Developers mailing list
>> > > > AMBER-Developers.ambermd.org
>> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
>> > > >
>> > > _______________________________________________
>> > > AMBER-Developers mailing list
>> > > AMBER-Developers.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
>> > >
>> > _______________________________________________
>> > AMBER-Developers mailing list
>> > AMBER-Developers.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri May 14 2021 - 02:00:02 PDT