Re: [AMBER-Developers] Desmond SUX

From: Scott Le Grand <varelse2005.gmail.com>
Date: Fri, 14 May 2021 05:00:33 -0700

So Amber already has 8x32 tiles. It has for a very long time. They
accelerate smaller systems three to five percent. But the expense of
increased memory bandwidth to generate the 8x32 tiles outweighs the benefit
for systems like stmv. And doubly so because of the 4 fs time step causing
it to rebuild the neighbor list every other iteration.

And I really wanted to use the tensor cores because the tensor cores are
cool but that involves reading in all the coordinates. Why bother when I
have a six-bit spatial subhash within each non-bond cell that I can treat
as a voxel? It travels along free with the cell ID. And one can make a
trivial accept mask and a trivial reject mask from it. This ought to cut
memory bandwidth by close to a half and significantly reduce the number of
atoms that have to be distance tested. It will also provide a tighter Hull
for the Union of spheres of the atoms in a non-bond tile. It may reduce the
number of atoms considered by nearly a factor of three in fact.

But I still wanted to use the fancy new hardware. I think it will have its
place in the ewald sum once again making it too fast to scale to multiple
processors now that multiple GPU ffts exist. Sadly, right now their
performance is highly variant and there of no benefit for any dimension
less than 128.

On Fri, May 14, 2021, 01:37 David Cerutti <dscerutti.gmail.com> wrote:

> So 8x8 tiles is actually the best way to go here, each warp does two of
> them at a time. What I was saying earlier.
>
> But how do we best use the tensor cores to build the pair list? If we have
> four atoms with coordinates a1 = { x1 y1 z1 }, a2 = { x2 y2 z2 }, ..., a4 =
> { x4 y4 z4 } and another four b5 = { x5 y5 z5 }, ..., b8 = { x8 y8 z8 },
> then we can make matrices A = [ x1 y1 z1 0 ; x2 y2 z2 0 ; ... ; x4 y4 z4 0
> ] and B = [ x5 x6 x7 x8 ; y5 y6 y7 y8 ; z5 z6 z7 z8 ; 0 0 0 0 ] such that
> A*(A') - 2*A*B + (B')*B = C, with C containing the interatomic distances.
> Is there anything more efficient we can do, to squeeze out the last 1/4 of
> the FLOPS in that tensor computation?
>
> On Fri, May 14, 2021 at 12:28 AM David Cerutti <dscerutti.gmail.com>
> wrote:
>
> > My feeling, which I have to qualify a great deal as I am not in the MD
> > coding business for now, is that you could see great benefits from 8 x 32
> > tiles, perhaps getting down to 8x16 tiles for the tail end of your
> > non-bonded blocks. The problem becomes one of computing the pairlist,
> and
> > I think that a pairlist in the general form of the one you have is
> > something to keep. Dropping vdW in a strawman code will indeed earn the
> > wages of Jack Schlitt, I can confirm that I did my own strawman of this.
> > However, there are significant gains to be made if one can separate
> > electrostatic and LJ interactions, due to the nature of all those water
> > models. TIP4P-based systems can go from 16 interactions per water to 10,
> > and TIP3P-based systems can benefit by keeping the electrostatic cutoff
> > small while moving to a reasonable vdW cutoff--think 8A and 10A,
> > respectively. Relative to the existing pmemd, a good approach could
> > accelerate the code by as much as 20%. The trick is to make the pairlist
> > in a rapid manner. Absolutely I would look into using 16-bit tensor
> cores
> > to apply the distance masks. But, again, I'd target 8x16, 8x32 tiles,
> the
> > 8x16 case being reserved for cases where there are less than 16 atoms
> left
> > to interact with the home eight.
> >
> > I'm still thinking that cleaner code resulting from a non-bonded strategy
> > that makes the neighbor list re-arrangement the exception rather than the
> > rule would be an overall benefit due to the ease of getting more
> developers
> > into the code by letting them easily interact with the atoms in an order
> > they can understand. To do that, one would need the neutral territory
> > methods that I outlined earlier in our private correspondence, and thread
> > blocks of perhaps 3x448 per SM or 2x704 per SM should fit within the
> > register limit. Making the neighborlist arrangement unique to the
> > non-bonded kernel would require a __syncthreads() at the end of each
> > neutral territory region, which would hurt less if you stuck all of the
> > 8x16 tiles at the back for idle warps to backfill in the most efficient
> > manner possible, but I think each warp would get on average 4-6 8x32
> tiles
> > with a portion of 8x16 tiles to divvy up, which would probably cost the
> > non-bonded kernel 10-15% and the code 4-5% of its overall speed. Again,
> > accessibility versus performance.
> >
> > There are lots of things Amber can do to get better. DAC did very well
> to
> > bring (and keep) people together, and produce the world's leading
> academic
> > molecular dynamics tools. Whoever takes the reins next should give some
> > serious thought to structural changes in the code base and collaborative
> > nature of the consortium, however.
> >
> > Dave
> >
> >
> > On Thu, May 13, 2021 at 11:00 PM Scott Le Grand <varelse2005.gmail.com>
> > wrote:
> >
> >> I have some broad questions about this dance they danced with fmaf
> >> etc.NVCC
> >> (also known as the code cuisinart) will reduce your fancy cycle counted
> >> code to a DAG and then re-express it in its own vision of optimality
> from
> >> the PTX. Attempting fine-grained cycle dances is rather pointless. You
> can
> >> play with functional unit utilization, memory throughput and you can
> even
> >> do some warp specialization, but unless you're coding up in SASS, your
> >> intent will be re-expressed according to the whims of the compiler. I
> have
> >> found sweating the details of data structures and stealing tricks from
> >> good
> >> old fashioned CS, good old fashioned CG, or even my SASA paper much more
> >> fruitful avenues of pursuit. That said, if you're talking about ANTON
> HW,
> >> oh yeah, brilliant work and it's sad no one has built a competitor to
> it.
> >>
> >> I got told recently that the cool kids zero out LJ interactions for
> >> solvent. So I coded up a strawman. It maybe gets 2% speedup overall at
> the
> >> expense of a lot more bookkeeping.
> >>
> >> The newest thing is to drop exclusions entirely and just subtract them
> >> later. But that has associative roundoff error if one is not careful.
> >>
> >> What I need to know here is implementation details. You and I are both
> >> toying with dropping neighbor lists entirely but in different ways.
> >>
> >> And it's not that I don't think their tricks are kosher so much as I
> think
> >> they should be disclosed upfront rather than having to dig for them.
> >>
> >> Scott
> >>
> >> On Thu, May 13, 2021 at 3:03 PM David Cerutti <dscerutti.gmail.com>
> >> wrote:
> >>
> >> > For reference, here are the benchmarks that I think people are talking
> >> > about:
> >> > Desmond-GPU Performance as of April 2021.pdf (deshawresearch.com)
> >> > <
> >> >
> >>
> https://www.deshawresearch.com/publications/Desmond-GPU%20Performance%20as%20of%20April%202021.pdf
> >> > >
> >> >
> >> > Desmond uses a different long-ranged summation, the "U-series" which
> >> was a
> >> > bit of a dance of the seven veils and then turned out to be very
> >> similar to
> >> > other P3M techniques, SPME included. The U-series was the way they go
> >> to
> >> > 8fs between updates to the long-ranged component of the
> electrostatics.
> >> > Regardless of what it is, though, I'll say that my own experiences in
> >> > multiple time stepping (see mdgx GB) don't leave much room to go
> higher
> >> > than 5fs in any component of the force. Long ago, circa 2015 their
> DHFR
> >> > benchmark was much faster than Amber (the 50% Scott is alluding to),
> >> which
> >> > seems to be a ratio they have maintained over the years, but it's now
> >> more
> >> > in line with the rest of the benchmarks--one can compute the number of
> >> > atoms moved by the code in a given time and see that the STMV case is,
> >> > indeed, moving substantially more than DHFR. It's pretty impressive
> >> that
> >> > they can do ten million atoms, but of course that's more of a stunt (I
> >> > would have been more inclined to do eight virions in a cube). That
> >> said,
> >> > the Desmond folks do some pretty wild orchestration of how many fmafs
> >> and
> >> > other arithmetic ops they can pull out of each cycle, so while their
> >> > numbers may be tweaked according to any given standard my feeling is
> >> that
> >> > "sales" are not a big incentive for them to cook the books.
> >> >
> >> > You can surely get more performance out of pmemd on the smaller
> systems
> >> if
> >> > you multiply the systems it simulates at one time. 2300ns per day
> with
> >> > DHFR on one of the top-end Ampere cards shouldn't be out of the
> >> question.
> >> > This should be one of the highest priorities in any renovations to the
> >> > engine, as most pharma outfits study problems of 25-50k atoms, must
> run
> >> > many windows before getting a single answer, and always have more
> >> compounds
> >> > to test than GPUs to do it. What I would also suggest is that
> anything
> >> > happening to pmemd's CUDA component is stuck behind some very old
> >> Fortran
> >> > code, with Pieces of a System flying around in a manner that's almost
> as
> >> > depressing as the film with Vanessa Kirby. Rebuild the 100k lines of
> >> > Fortran in C++ with accessible, well-engineered structs that are hard
> to
> >> > break. Topologies, coordinates, and simulation protocols can all be
> >> > structs passed around and created or destroyed as needed by a
> protocol.
> >> > Give them each pointer structs that can be copied to the GPU in a
> manner
> >> > analogous to cSim today, or preferably as multiple, focused pointer
> >> structs
> >> > that become kernel arguments when the actual kernel is launched (the
> >> > long-ranged electrostatic kernel doesn't need to know about the bonded
> >> > parameter constants, for example--a Prmtop struct can have multiple
> >> pointer
> >> > substructures tailored for different parts of the force calculation).
> >> Make
> >> > the kernels for producing work units operate on arrays of such
> structs,
> >> so
> >> > that a force kernel will seamlessly stride from one system to the next
> >> as
> >> > it plays its part in any given time step. You should const as much as
> >> > possible but const auto may be something to use sparingly, so that new
> >> > developers will become better immersed in the actual nuts and bolts of
> >> the
> >> > code by seeing the actual data types. That will give upcoming
> >> > graduate students more to work with and help them to understand the
> CUDA
> >> > code as something much more C / C++ -like.
> >> >
> >> > Don't gnash your teeth over what DE Shaw's guys have achieved. The
> >> things
> >> > that drive sales are utility and unique capabilities, two things that
> >> Amber
> >> > has done pretty well with despite being the product of a handful of
> >> > research groups who mostly prefer to see everyone staying in their
> >> > respective lanes. Standardize what a "topology" is and make a clean,
> >> > efficient, extensible tool for creating systems. That should be the
> >> first
> >> > stop for anyone thinking of adding new components to the force field
> or
> >> a
> >> > free energy protocol. Document the hell out of everything. Stop
> >> relying
> >> > on one Bob, or Scott, or me, or Taisung, or Scott again to
> >> > MakeABetterEngine.cu. That needs to be a community activity, and it
> >> will
> >> > improve the employment prospects of your students to have them
> involved
> >> in
> >> > professional python / C++ / CUDA programming. Be honest about your
> >> > benchmarks and make a new section of the website as an exposition of
> >> > Amber's free energy capabilities. It shouldn't take five years for
> >> > advertising that doesn't support the group interest to be taken off
> the
> >> > website, or for a researcher with unique ideas and much stronger
> >> > associations to the consortium to finally get priority over an
> >> > undergraduate who left the group years earlier. Even an academic
> >> > organization with $350,000 annual revenue shouldn't continue to rely
> on
> >> a
> >> > former member to donate his time and money just to keep their CI up
> and
> >> > running, regardless of his generosity in doing so. The DE Shaw Group
> >> is a
> >> > professional organization of extremely talented, probably overworked
> >> > individuals united by their goals of advancing molecular simulations.
> >> Stop
> >> > comparing the benchmarks unless you want to start comparing the
> >> > organizations.
> >> >
> >> > Dave
> >> >
> >> >
> >> > On Thu, May 13, 2021 at 4:48 PM Scott Le Grand <varelse2005.gmail.com
> >
> >> > wrote:
> >> >
> >> > > To me, it's a sales trick until they demonstrate numerical stability
> >> to
> >> > the
> >> > > level Ross and I did with SPFP and SPDP. Have they? But even if it's
> >> not
> >> > > that stable, at least customers can make an informed choice with
> such
> >> > data,
> >> > > no? Also, how often are they rebuilding the neighbor list? Is it a
> >> fixed
> >> > > interval like GROMACS or is there a skin test?
> >> > >
> >> > > I am rethinking all this currently and I have friends who think
> >> Neighbor
> >> > > lists are obsolete if we move to higher timesteps and larger nonbond
> >> > > cutoffs, but that brings us to how do we handle exclusions and
> that's
> >> a
> >> > > rabbit hole. But... Coincidentally, SPFP's perfect force
> conservation
> >> can
> >> > > let you add and subtract them if you cap their magnitudes or use
> some
> >> > > variant of softcore to control dynamic range. But are they doing
> >> anything
> >> > > like this? Details are everything!
> >> > >
> >> > > On Thu, May 13, 2021 at 1:39 PM Michael R Shirts <
> >> > > Michael.Shirts.colorado.edu> wrote:
> >> > >
> >> > > > > and they skipped calculating the Ewald Sum every other iteration
> >> > > (thanks
> >> > > > Adrian!).
> >> > > >
> >> > > > In their semi-defense, IIRC, their default on all DESMOND
> >> simulations
> >> > for
> >> > > > a while has been to do multiple timestepping of forces, including
> >> Ewald
> >> > > sum
> >> > > > every other timestep. It's not entirely clear to me if this is
> >> > > sufficiently
> >> > > > accurate, and they definitely should make that clearer that they
> are
> >> > > doing
> >> > > > something different, but it's a valid approach (that more people
> >> should
> >> > > be
> >> > > > investigating!) and it's not just a sales trick. Not that there
> >> aren't
> >> > > > also sales tricks out there.
> >> > > >
> >> > > > Best,
> >> > > > ~~~~~~~~~~~~~~~~
> >> > > > Michael Shirts
> >> > > > Associate Professor
> >> > > > michael.shirts.colorado.edu
> >> > > > http://www.colorado.edu/lab/shirtsgroup/
> >> > > > Phone: (303) 735-7860
> >> > > > Office: JSCBB C123
> >> > > > Department of Chemical and Biological Engineering
> >> > > > University of Colorado Boulder
> >> > > >
> >> > > >
> >> > > > On 5/13/21, 1:27 PM, "Scott Le Grand" <varelse2005.gmail.com>
> >> wrote:
> >> > > >
> >> > > > So, we're all getting our knickers in a bunch over an Apples
> to
> >> > > Oranges
> >> > > > Desmond to AMBER performance comparison.
> >> > > >
> >> > > > Please don't...
> >> > > >
> >> > > > They cheated, because that's what they do to keep their
> >> investors
> >> > > > happy.
> >> > > > They used a 32^3 grid, and they skipped calculating the Ewald
> >> Sum
> >> > > every
> >> > > > other iteration (thanks Adrian!). Rather than get upset here,
> >> point
> >> > > and
> >> > > > laugh at DE Shaw et al. that they are afraid to go head to
> head
> >> > with
> >> > > > AMBER,
> >> > > > and if they do (and they won't because they're chicken bawk
> bawk
> >> > > > bawk), we
> >> > > > have the people to address that as well.
> >> > > >
> >> > > > At our end, there's a ~50% or so performance deficit in AMBER
> >> 20 we
> >> > > > need to
> >> > > > fix. I've already fixed 2/3 of that building PMEMD 2.0 (770
> >> ns/day
> >> > > > DHFR 2
> >> > > > fs already). Let them prance about with their greasy kids
> stuff
> >> > > > desperate
> >> > > > approximations and cheats, SPFP remains performance and
> accuracy
> >> > with
> >> > > > compromise and if they want to pick a fight with SPFP, make
> >> them do
> >> > > the
> >> > > > work to demonstrate equivalent numerical stability (spoilers:
> >> they
> >> > > > won't
> >> > > > because they can't but oh the bellyacheing and handwaving they
> >> are
> >> > > > willing
> >> > > > to do, just watch).
> >> > > >
> >> > > > Scott
> >> > > > _______________________________________________
> >> > > > AMBER-Developers mailing list
> >> > > > AMBER-Developers.ambermd.org
> >> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >> > > >
> >> > > > _______________________________________________
> >> > > > AMBER-Developers mailing list
> >> > > > AMBER-Developers.ambermd.org
> >> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >> > > >
> >> > > _______________________________________________
> >> > > AMBER-Developers mailing list
> >> > > AMBER-Developers.ambermd.org
> >> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >> > >
> >> > _______________________________________________
> >> > AMBER-Developers mailing list
> >> > AMBER-Developers.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >> >
> >> _______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri May 14 2021 - 05:30:02 PDT
Custom Search