Re: [AMBER-Developers] Desmond SUX from David Cerutti on 2021-05-14 (Amber Developers Archive May 2021)

From: David Cerutti <dscerutti.gmail.com>
Date: Fri, 14 May 2021 13:10:15 -0400

So, what worked with that "honeycomb" code was:
1.) If you had a small enough system, the pairlist build spread quickly
over the card and occupied all the SMs. In the JAC / DHFR case it was
quicker than the current pmemd build. But once that Hilbert Space pairlist
build gets up to speed, for 100k atoms + it was much more efficient than
the compute-tons of pairwise distances method I was doing.
2.) If you had separate cutoffs, the shorter electrostatics and much lower
density of vdW interactions, in any circumstances, came together for a
10-15% win.
3.) If you had a water model where electrostatics and vdW were indeed
separate particles, i.e. TIP4P-Ew, this was also a 10-15% win independent
and additive to the cutoffs.

What didn't work was the following:
1.) Way too perfectionist in building the pairlist. I was making 4x8 tiles
and singleton--1:1 listed interactions. Not a good strategy to have to do
so many atomicAdd ops to __shared__ integers.
2.) Way too fine a dissection of the unit cell--I was cutting the thing
into I think 1/4 (cutoff + buffer) thickness pencils. The atom lists were
linear in their z positions along each pencil, which meant they advanced
far too quickly in z to give me good groups of mutually interacting atoms.
It made the tiles too small, and often I couldn't even make the 4x8 tiles,
I had to roll with singletons which meant horrible amounts of GMEM reading
just to know what to calculate.
3.) No exploitation of accept masks.
4.) Tiles too small.
5.) Trying to eliminate the 1:4 interactions from the non-bonded exclusion
list--I was trying to roll with only 1:2 and 1:3 exclusions, and doing the
1:4 exclusions as part of the bonded work units cost me both time and
precision.

How to fix:
I really think that a 1/2 (cutoff + buffer) thickness on the pencils
coupled with a move to an 8x8 tile strategy will recover most of the
benefits of any tile approach and mitigate the memory bandwidth problems.
Introducing an accept mask strategy could also help more quickly sort
through the pencils to pick out all the relevant interactions. You'd end
up with the following: [ Starting atom for a string of 8 "x" consecutive
atoms ] [ 64-bit exclusion mask ] [ 64-bit transposed exclusion mask ] [
"y" atom 0 ] [ "y" atom 1 ] ... [ "y" atom 7 ]. That's a total of 13
32-bit unsigned ints. Two of those back-to-back is 26, which is two 8x8
tiles and precisely what I think a warp should be doing. Warp reads its
marching orders... gets 26/32 relevant integers, which is pretty good use
of a memory transaction and, by my back-of-the-envelope calculations,
rivals what the current pmemd non-bond list is doing. That read should
actually be padded such that the "y" atoms are read by the threads that
will actually use them, so three blanks after each pair of exclusion masks
rather than at the very end. Then this happens:
- Threads 0-7 read tile I's "x" atoms and __shfl to get tile I's
un-transposed exclusion mask
- Threads 8-15 read tile I's "y" atoms and __shfl to get tile I's
transposed exclusion mask
- Threads 16-23 read tile II's "x" atoms and __shfl to get its
un-transposed exclusion mask
- Threads 24-31 read tile II's "y" atoms and __shfl to get its transposed
exclusion mask
- Threads 0-7 __shfl from threads 8-15 to get their "y" atoms while threads
8-15 __shfl to get their "x" atoms from threads 0-7. Same with threads
16-23 and thread 24-31.
- All threads compute interactions.
- Each thread realizes that some other thread was __shfl-reading from it,
and calculates which thread that was. All threads then go to their
partners, who have force information relevant to their own "x" or "y"
atoms. (Note that this strategy can also work for a single 16x16 tile,
which I use in mdgx GB simulations. The benefit is to improve granularity
of tiles in general, and conserve registers by not having any one thread
ever need to store information on more than a single atom.)
- After four cycles, the 32 threads have done all interactions in both 8x8
tiles. Convert FINAL TILE RESULTS to int4 format, storing results in
roughly 38 bits of fixed precision (each force component gets 32 bits plus
ten bits of overflow from that INT4.w int... it's not feasible to get 42
bits of actual precision this way but 37-38 absolutely and that's all you
need).
- Write results to __shared__, or if the import just had too many atoms
have a bailout mechanism for these writes to go back to GMEM (that will be
extremely rare if the rest of the code is written correctly, even for nutty
users who want to use TIP5P and 14A cutoffs... there is enough __shared__
per 3 blocks to store forces on 1300 atoms, plenty for the NT method I have
outlined).
- Finish all (pairs of) 8x8 tiles for a given collection of NT regions... I
expect the typical DHFR benchmark to have 504 collections based on a 62A
cube and a 9A cutoff with a 1.333A buffer.
- Write results back to GMEM. Note that, even for a straight-from-topology
ordering of the atoms, the 16-byte nature of each atom's forces and the
tendency of 3-4 atoms to always be near each other would imply decent
coalescence even in these GMEM transactions, so from outside the nonbonded
calculation any other developer would see the original ordering of atoms at
all times.

For a typical 9A cutoff in a typical TIP3P system, eight atoms will tend to
occupy a pencil prism (1/2) (cutoff + buffer) thick and (1/3) (cutoff +
buffer) tall as measured from the base, so that's also a very nice,
spherical-as-possible volume of space to start from. For vdW-only
interactions eight atoms would occupy, on average, a pencil prism of the
same width but (cutoff + buffer) in height. Still not too bad.

That's the best I can give you... pounce, Simba!

Dave

On Fri, May 14, 2021 at 12:15 PM Scott Le Grand <varelse2005.gmail.com>
wrote:

> Well exactly, and I'm happy to be the James Randi of Molecular Dynamics
> after spending a post-doc in the 90s finding out how (seemingly
> intentionally) irreproducible so much computational science research turns
> out to be unless they give you source code. And while that's water under
> the bridge, the wisdom and the scars still stand for me. It served me well
> w/r to the AI boom too because it was the embodiment of SSND.
>
> AMBER has unique numerical stability and bitwise reproducible execution.
> But it's not entirely invalid to state that accuracy errors in the science
> swamp any errors in the precision so put a thermostat on it and cross your
> fingers. It is, however, craptastic engineering to give up on deterministic
> reproducible execution if only because doing so hides race conditions,
> uninitialized variables and a flurry of other bugs. I don't and I won't
> have access to Desmond's source today because I'm in industry or likely
> ever because I'm not a professor.But I would love a cuda profile of its
> execution to understand the nature of the shortcuts they take. We can
> always add them to AMBER's SPXP mode but a lot these days is more memory
> than math driven on SM 7 and beyond. So anything that reduces memory
> traffic is immediately going to speed things up.
>
> Scott
>
> On Fri, May 14, 2021 at 7:17 AM Gustavo Seabra <gustavo.seabra.gmail.com>
> wrote:
>
> > Thanks Dave for posting the link. I've been following the discussions
> > trying to figure out what they were talking about ;-)
> >
> > I find the discussion very interesting though, and I see the benchmarks
> are
> > really impressive, in terms of speed, and I do now want to enter into
> > motive considerations here.
> >
> > However, I do have one question that does not seem to be addressed in the
> > benchmarks. How precise are the calculations? With such speed, should it
> > not be possible to calculate a number of physical quantities and compare
> to
> > experimental values? IMHO, that is the only way to assess if the "tricks"
> > used are really good science. Do they break anything, or lead to the
> wrong
> > behavior of the system?
> >
> > Ultimately, I'd rather wait longer to get correct results. But of course,
> > that is also valid for Amber benchmarks.
> >
> > All the best,
> > --
> > Gustavo Seabra.
> >
> >
> > On Thu, May 13, 2021 at 6:03 PM David Cerutti <dscerutti.gmail.com>
> wrote:
> >
> > > For reference, here are the benchmarks that I think people are talking
> > > about:
> > > Desmond-GPU Performance as of April 2021.pdf (deshawresearch.com)
> > > <
> > >
> >
> https://www.deshawresearch.com/publications/Desmond-GPU%20Performance%20as%20of%20April%202021.pdf
> > > >
> > >
> > > Desmond uses a different long-ranged summation, the "U-series" which
> was
> > a
> > > bit of a dance of the seven veils and then turned out to be very
> similar
> > to
> > > other P3M techniques, SPME included. The U-series was the way they go
> to
> > > 8fs between updates to the long-ranged component of the electrostatics.
> > > Regardless of what it is, though, I'll say that my own experiences in
> > > multiple time stepping (see mdgx GB) don't leave much room to go higher
> > > than 5fs in any component of the force. Long ago, circa 2015 their
> DHFR
> > > benchmark was much faster than Amber (the 50% Scott is alluding to),
> > which
> > > seems to be a ratio they have maintained over the years, but it's now
> > more
> > > in line with the rest of the benchmarks--one can compute the number of
> > > atoms moved by the code in a given time and see that the STMV case is,
> > > indeed, moving substantially more than DHFR. It's pretty impressive
> that
> > > they can do ten million atoms, but of course that's more of a stunt (I
> > > would have been more inclined to do eight virions in a cube). That
> said,
> > > the Desmond folks do some pretty wild orchestration of how many fmafs
> and
> > > other arithmetic ops they can pull out of each cycle, so while their
> > > numbers may be tweaked according to any given standard my feeling is
> that
> > > "sales" are not a big incentive for them to cook the books.
> > >
> > > You can surely get more performance out of pmemd on the smaller systems
> > if
> > > you multiply the systems it simulates at one time. 2300ns per day with
> > > DHFR on one of the top-end Ampere cards shouldn't be out of the
> question.
> > > This should be one of the highest priorities in any renovations to the
> > > engine, as most pharma outfits study problems of 25-50k atoms, must run
> > > many windows before getting a single answer, and always have more
> > compounds
> > > to test than GPUs to do it. What I would also suggest is that anything
> > > happening to pmemd's CUDA component is stuck behind some very old
> Fortran
> > > code, with Pieces of a System flying around in a manner that's almost
> as
> > > depressing as the film with Vanessa Kirby. Rebuild the 100k lines of
> > > Fortran in C++ with accessible, well-engineered structs that are hard
> to
> > > break. Topologies, coordinates, and simulation protocols can all be
> > > structs passed around and created or destroyed as needed by a protocol.
> > > Give them each pointer structs that can be copied to the GPU in a
> manner
> > > analogous to cSim today, or preferably as multiple, focused pointer
> > structs
> > > that become kernel arguments when the actual kernel is launched (the
> > > long-ranged electrostatic kernel doesn't need to know about the bonded
> > > parameter constants, for example--a Prmtop struct can have multiple
> > pointer
> > > substructures tailored for different parts of the force calculation).
> > Make
> > > the kernels for producing work units operate on arrays of such structs,
> > so
> > > that a force kernel will seamlessly stride from one system to the next
> as
> > > it plays its part in any given time step. You should const as much as
> > > possible but const auto may be something to use sparingly, so that new
> > > developers will become better immersed in the actual nuts and bolts of
> > the
> > > code by seeing the actual data types. That will give upcoming
> > > graduate students more to work with and help them to understand the
> CUDA
> > > code as something much more C / C++ -like.
> > >
> > > Don't gnash your teeth over what DE Shaw's guys have achieved. The
> > things
> > > that drive sales are utility and unique capabilities, two things that
> > Amber
> > > has done pretty well with despite being the product of a handful of
> > > research groups who mostly prefer to see everyone staying in their
> > > respective lanes. Standardize what a "topology" is and make a clean,
> > > efficient, extensible tool for creating systems. That should be the
> > first
> > > stop for anyone thinking of adding new components to the force field
> or a
> > > free energy protocol. Document the hell out of everything. Stop
> relying
> > > on one Bob, or Scott, or me, or Taisung, or Scott again to
> > > MakeABetterEngine.cu. That needs to be a community activity, and it
> will
> > > improve the employment prospects of your students to have them involved
> > in
> > > professional python / C++ / CUDA programming. Be honest about your
> > > benchmarks and make a new section of the website as an exposition of
> > > Amber's free energy capabilities. It shouldn't take five years for
> > > advertising that doesn't support the group interest to be taken off the
> > > website, or for a researcher with unique ideas and much stronger
> > > associations to the consortium to finally get priority over an
> > > undergraduate who left the group years earlier. Even an academic
> > > organization with $350,000 annual revenue shouldn't continue to rely
> on a
> > > former member to donate his time and money just to keep their CI up and
> > > running, regardless of his generosity in doing so. The DE Shaw Group
> is
> > a
> > > professional organization of extremely talented, probably overworked
> > > individuals united by their goals of advancing molecular simulations.
> > Stop
> > > comparing the benchmarks unless you want to start comparing the
> > > organizations.
> > >
> > > Dave
> > >
> > >
> > > On Thu, May 13, 2021 at 4:48 PM Scott Le Grand <varelse2005.gmail.com>
> > > wrote:
> > >
> > > > To me, it's a sales trick until they demonstrate numerical stability
> to
> > > the
> > > > level Ross and I did with SPFP and SPDP. Have they? But even if it's
> > not
> > > > that stable, at least customers can make an informed choice with such
> > > data,
> > > > no? Also, how often are they rebuilding the neighbor list? Is it a
> > fixed
> > > > interval like GROMACS or is there a skin test?
> > > >
> > > > I am rethinking all this currently and I have friends who think
> > Neighbor
> > > > lists are obsolete if we move to higher timesteps and larger nonbond
> > > > cutoffs, but that brings us to how do we handle exclusions and
> that's a
> > > > rabbit hole. But... Coincidentally, SPFP's perfect force conservation
> > can
> > > > let you add and subtract them if you cap their magnitudes or use some
> > > > variant of softcore to control dynamic range. But are they doing
> > anything
> > > > like this? Details are everything!
> > > >
> > > > On Thu, May 13, 2021 at 1:39 PM Michael R Shirts <
> > > > Michael.Shirts.colorado.edu> wrote:
> > > >
> > > > > > and they skipped calculating the Ewald Sum every other iteration
> > > > (thanks
> > > > > Adrian!).
> > > > >
> > > > > In their semi-defense, IIRC, their default on all DESMOND
> simulations
> > > for
> > > > > a while has been to do multiple timestepping of forces, including
> > Ewald
> > > > sum
> > > > > every other timestep. It's not entirely clear to me if this is
> > > > sufficiently
> > > > > accurate, and they definitely should make that clearer that they
> are
> > > > doing
> > > > > something different, but it's a valid approach (that more people
> > should
> > > > be
> > > > > investigating!) and it's not just a sales trick. Not that there
> > aren't
> > > > > also sales tricks out there.
> > > > >
> > > > > Best,
> > > > > ~~~~~~~~~~~~~~~~
> > > > > Michael Shirts
> > > > > Associate Professor
> > > > > michael.shirts.colorado.edu
> > > > > http://www.colorado.edu/lab/shirtsgroup/
> > > > > Phone: (303) 735-7860
> > > > > Office: JSCBB C123
> > > > > Department of Chemical and Biological Engineering
> > > > > University of Colorado Boulder
> > > > >
> > > > >
> > > > > On 5/13/21, 1:27 PM, "Scott Le Grand" <varelse2005.gmail.com>
> > wrote:
> > > > >
> > > > > So, we're all getting our knickers in a bunch over an Apples to
> > > > Oranges
> > > > > Desmond to AMBER performance comparison.
> > > > >
> > > > > Please don't...
> > > > >
> > > > > They cheated, because that's what they do to keep their
> investors
> > > > > happy.
> > > > > They used a 32^3 grid, and they skipped calculating the Ewald
> Sum
> > > > every
> > > > > other iteration (thanks Adrian!). Rather than get upset here,
> > point
> > > > and
> > > > > laugh at DE Shaw et al. that they are afraid to go head to head
> > > with
> > > > > AMBER,
> > > > > and if they do (and they won't because they're chicken bawk
> bawk
> > > > > bawk), we
> > > > > have the people to address that as well.
> > > > >
> > > > > At our end, there's a ~50% or so performance deficit in AMBER
> 20
> > we
> > > > > need to
> > > > > fix. I've already fixed 2/3 of that building PMEMD 2.0 (770
> > ns/day
> > > > > DHFR 2
> > > > > fs already). Let them prance about with their greasy kids stuff
> > > > > desperate
> > > > > approximations and cheats, SPFP remains performance and
> accuracy
> > > with
> > > > > compromise and if they want to pick a fight with SPFP, make
> them
> > do
> > > > the
> > > > > work to demonstrate equivalent numerical stability (spoilers:
> > they
> > > > > won't
> > > > > because they can't but oh the bellyacheing and handwaving they
> > are
> > > > > willing
> > > > > to do, just watch).
> > > > >
> > > > > Scott
> > > > > _______________________________________________
> > > > > AMBER-Developers mailing list
> > > > > AMBER-Developers.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > >
> > > > > _______________________________________________
> > > > > AMBER-Developers mailing list
> > > > > AMBER-Developers.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > > >
> > > > _______________________________________________
> > > > AMBER-Developers mailing list
> > > > AMBER-Developers.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > > >
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri May 14 2021 - 10:30:02 PDT