Re: [AMBER-Developers] CMake in Amber from Scott Le Grand on 2021-04-04 (Amber Developers Archive Apr 2021)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Sun, 4 Apr 2021 13:30:57 -0700

1. I'm all for a better JAC NVE benchmark of the same approximate size but
it's a great strawman for multi-GPU. But also, STMV rebuilds the NL every
other iteration because it achieves a skinnb of ~1.05 as opposed to JAC's
2.37 A. Halving the nonbond cells will help with that a bit, but just
naively using a 2 A skin with STMV gets about the same perf ironically.
2. My assumption is I am like Charlton Heston in The Omega Man and I am on
my own here. If there's help, there's so much more we can get done here.
But my goal is making AMBER into something with external libraries making
its integrator accessible from AI frameworks as well as allow for plug-ins
for new science so we don't reach the spaghetti code singularity as quickly
the next round. Multiple simulations within the same process is on the
list. I just wish the guys at Dart hadn't been so evasive about needing it
with me or it could have been added a long time ago.
3. I try to keep kernels simple. That makes them easier to refactor for
future HW. You guys hit peak perfon SM 7, but a lot of what you did seems
to get hammered on SM 8. But hey, one of the Autodock guys called me a
bully for raising that point about GPU architectures, so I think we're good
here. You really have to recalculate the kernel blocking once per HW
generation because the limits go all over the place:
https://en.wikipedia.org/wiki/CUDA

And even as a distinguished engineer, I cannot predict the future. So, as a
good friend used to say: "future proofing is best done in the future."

On Sun, Apr 4, 2021 at 1:18 PM David Cerutti <dscerutti.gmail.com> wrote:

> My suggestion on the JAC 2fs benchmark would be to look at the temperature
> of that system: it's up around 400K, which is going to be breaking the
> pairlist much more frequently and isn't really safe for 2fs NVE in the
> first place. I'm with you on the idea of looking at small systems to find
> better ways to distribute over many cores and more SMs (one metric might
> even be the overall amount of L1 that's coming with those expanding arrays
> of multiprocessors). But make sure to have a good handle on the science
> and what people want to do with these calculations before committing to raw
> speed on an ordinary, small system.
>
> If you're building pmemd2.0, is it just the CUDA part or will the
> underlying Fortran code get refactored and renovated? My feeling is that a
> professional programmer could recode pmemd as C++ with added features to
> read any number of topologies, multiple systems at once that are only
> coupled as much as the problem at hand wants them to be (or not). Four
> copies of JAC running in one executable would probably run at 2x the
> original throughput, especially if you can batch the FFTs, and even smaller
> systems like a 10-15k atom hydration free energy calculation could run 4-8x
> batched calculations at more than 4x the original throughput. The free
> energy calculations and most other things people are really hoping to do
> with MD require completing multiple related trajectories, not one flat-out
> equilibrium run.
>
> But, if you can keep up the 64-bit accumulation and maintain the perf,
> that's some very impressive numerical acrobatics. Will be interesting to
> see the results, as there are classes of calculations that require double
> precision in many more places than standard MD. Also I'm cheering for you
> if you can find a use for these fancy new 32-bit tensor cores. Look at the
> mdgx GB code--that uses one warp to do a 16x16 tile, eight rounds of 32
> pair calculations at a time reading the x atoms into the first 16 lanes and
> the y atoms into the second 16 lanes. May help you in the PME refactor, or
> get you to that magic number 16 for feeding into whatever tensor core
> operations.
>
> Keep up the great work,
> Dave
>
>
> On Sun, Apr 4, 2021 at 4:02 PM Scott Le Grand <varelse2005.gmail.com>
> wrote:
>
> > PPPS There should be no explicit limit on TI atoms whatsoever. AMBER
> should
> > handle that under the hood to let the scientists science the &h!+ out of
> > things, fight me on that.
> >
> > On Sun, Apr 4, 2021 at 12:58 PM Scott Le Grand <varelse2005.gmail.com>
> > wrote:
> >
> > > PPS the future* would appear to be more cores of approximately the same
> > > computational power as Ampere, not the same number of cores but
> beefier.
> > As
> > > such, we need to figure out how to distribute the same basis set of
> > > calculations across more cores going forward. Doubly so now that we
> have
> > > NVLINK, an interconnect that makes multi-GPU not suck.
> > >
> > > *A prediction pulled entirely from my Easter Bonnet(tm) based on the
> > > progression from SM 5 and SM 8 and which should not at all be construed
> > as
> > > insider information because it's not.
> > >
> > > On Sun, Apr 4, 2021 at 12:50 PM Scott Le Grand <varelse2005.gmail.com>
> > > wrote:
> > >
> > >> PS I'm killing off both TI paths and writing the path I wanted to
> write
> > >> in the first place that both exploits the original Uber-kernels and
> > >> Taisung's multi-streaming variant incorporating Darren and Taisung's
> > >> improvements in the science whilst doing so. After those six
> impossible
> > >> things or so, breakfast...
> > >>
> > >> On Sun, Apr 4, 2021 at 12:46 PM Scott Le Grand <varelse2005.gmail.com
> >
> > >> wrote:
> > >>
> > >>> 1) Because that's the benchmark we've used since day one. Apples to
> > >>> apples and all that. It's a relatively small system for single GPU
> > which is
> > >>> the perfect stand-in for large system multi-GPU efficiency. My goal
> is
> > 4x
> > >>> scaling on 8 GPUs with a positive scaling experience beyond that
> > relaxing
> > >>> system size limits up to 1B atoms in the process. If JAC gets
> faster,
> > we
> > >>> can scale farther.
> > >>> 2) Because the path to AMBER 20 broke multiple implicit assumptions
> in
> > >>> my design* for AMBER so I went back in time to change the future. All
> > >>> relevant functionality will be restored over time, but I spent 6
> > months of
> > >>> 2020 trying to do exactly that before throwing my arms up in utter
> > >>> frustration. The alternative is walking away from all the code and
> > starting
> > >>> a new framework.
> > >>> 3) RTX3090**
> > >>> 4) Remember this is PMEMD 2.0 we're building here. It's been almost
> 12
> > >>> years, it's time to rewrite.
> > >>>
> > >>> But... Your local force code still shows an acceleration over the
> > >>> original local force code even at full 64-bit accumulation. So that's
> > >>> getting refactored along the way. Everything else so far ia a perf
> > >>> regression without the precision model changes alas. But... you and I
> > have
> > >>> accidentally created working variant of SPXP. Your stuff will live
> > again in
> > >>> its reviva and you get first authorship IMO because while it's great
> > work,
> > >>> it's *not* SPFP with those precision changes in place (18-bit
> mantissa?
> > >>> C'mon man(tm)...)
> > >>>
> > >>> *Should have spelled them out, but even I couldn't predict the end of
> > >>> the CUDA Fellow program a priori ending any support for further work,
> > but
> > >>> now my bosses have let we work on it again as my dayjob.
> > >>> **
> > >>>
> >
> https://www.exxactcorp.com/blog/Molecular-Dynamics/rtx3090-benchmarks-for-hpc-amber-a100-vs-rtx3080-vs-2080ti-vs-rtx6000
> > >>>
> > >>> On Sun, Apr 4, 2021 at 12:03 PM David Cerutti <dscerutti.gmail.com>
> > >>> wrote:
> > >>>
> > >>>> "Meanwhile, AMBER16 refactored to SM 7 and beyond is already hitting
> > 730
> > >>>> ns/day on JAC NVE 2 fs. AMBER20 with the grid interpolation and
> local
> > >>>> force
> > >>>> precision sub FP32 force hacks removed hits 572 ns/day (down from
> 632
> > if
> > >>>> left in as we shipped it). That puts me nearly 1/3 to my goal of
> > >>>> doubling
> > >>>> overall AMBER performance which is what is important to me and where
> > I'm
> > >>>> going to focus my efforts..."
> > >>>>
> > >>>> Please explain here.
> > >>>> 1.) Why are we back to using the old JAC NVE 2fs benchmark? The new
> > >>>> benchmarks were redesigned several years ago to make more uniform
> > tests
> > >>>> and
> > >>>> take settings that standard practitioners are now using.
> > >>>> 2.) Why is Amber16 being refactored rather than Amber20?
> > >>>> 3.) What does it mean to be hitting 730 ns/day? What card is being
> > >>>> compared here--the Amber20 benchmarks look like they could be a
> V100,
> > >>>> Titan-V, or perhaps an RTX-2080Ti.
> > >>>>
> > >>>>
> > >>>> On Sun, Apr 4, 2021 at 12:11 PM Scott Le Grand <
> varelse2005.gmail.com
> > >
> > >>>> wrote:
> > >>>>
> > >>>> > But getting back on topic, CUDA 7.5 is a 2015 toolkit and SM 5.x
> and
> > >>>> below
> > >>>> > are deprecated now. SM 6 is a huge jump over SM 5 enabling true
> > >>>> virtual
> > >>>> > memory and I suggest deprecating support for SM 5 across the
> board.
> > >>>> SM 7
> > >>>> > and beyond alas mostly complicated warp programming and introduced
> > >>>> tensor
> > >>>> > cores which currently seem useless for straight MD, but perfect
> for
> > >>>> running
> > >>>> > AI models inline with MD.
> > >>>> >
> > >>>> > CUDA 8 is a 2017 toolkit. That's way too soon to deprecate IMO and
> > if
> > >>>> cmake
> > >>>> > has ish with it, that's a reason not to use cmake, not a reason to
> > >>>> > deprecate CUDA 8.
> > >>>> >
> > >>>> >
> > >>>> > On Sun, Apr 4, 2021 at 8:55 AM Scott Le Grand <
> > varelse2005.gmail.com>
> > >>>> > wrote:
> > >>>> >
> > >>>> > > Ross sent me two screenshots of cmake losing its mind with an
> 11.x
> > >>>> > > toolkit. I'll file an issue, but no, I'm not going to fix cmake
> > >>>> issues
> > >>>> > > myself at all. I'm open to someone convincing me cmake is better
> > >>>> than the
> > >>>> > > configure script, but no one has made that argument yet beyond
> > >>>> "because
> > >>>> > > cmake" and until that happens, that just doesn't work for me.
> > Happy
> > >>>> to
> > >>>> > > continue helping with the build script that worked until
> convinced
> > >>>> > > otherwise. Related: I still use nvprof, fight me.
> > >>>> > >
> > >>>> > > Meanwhile, AMBER16 refactored to SM 7 and beyond is already
> > hitting
> > >>>> 730
> > >>>> > > ns/day on JAC NVE 2 fs. AMBER20 with the grid interpolation and
> > >>>> local
> > >>>> > force
> > >>>> > > precision sub FP32 force hacks removed hits 572 ns/day (down
> from
> > >>>> 632 if
> > >>>> > > left in as we shipped it). That puts me nearly 1/3 to my goal of
> > >>>> doubling
> > >>>> > > overall AMBER performance which is what is important to me and
> > >>>> where I'm
> > >>>> > > going to focus my efforts as opposed to the new shiny build
> system
> > >>>> that
> > >>>> > is
> > >>>> > > getting better (and I *hate* cmake for cmake's sake), but we
> > rushed
> > >>>> it to
> > >>>> > > production IMO like America reopened before the end of the
> > pandemic.
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > > On Sun, Apr 4, 2021 at 5:51 AM David A Case <
> > david.case.rutgers.edu
> > >>>> >
> > >>>> > > wrote:
> > >>>> > >
> > >>>> > >> On Sat, Apr 03, 2021, Scott Le Grand wrote:
> > >>>> > >>
> > >>>> > >> >cmake is still not quite ready for prime time disruption of
> > >>>> configure.
> > >>>> > >> It's
> > >>>> > >> >getting there though.
> > >>>> > >>
> > >>>> > >> If there are problems with cmake, please create an issue on
> > >>>> gitlab, and
> > >>>> > >> mention .multiplemonomials to get Jamie's attention. Please
> try
> > to
> > >>>> > avoid
> > >>>> > >> the syndrome of saying "I can get this to work with configure,
> > and
> > >>>> I'm
> > >>>> > to
> > >>>> > >> busy right now to do anything else."
> > >>>> > >>
> > >>>> > >> I have removed the documentation for the configure process in
> the
> > >>>> > Amber21
> > >>>> > >> Reference Manual, although the files are still present. We
> can't
> > >>>> > continue
> > >>>> > >> to support and test two separate build systems, each with their
> > own
> > >>>> > bugs.
> > >>>> > >>
> > >>>> > >> ...thx...dac
> > >>>> > >>
> > >>>> > >>
> > >>>> > >> _______________________________________________
> > >>>> > >> AMBER-Developers mailing list
> > >>>> > >> AMBER-Developers.ambermd.org
> > >>>> > >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >>>> > >>
> > >>>> > >
> > >>>> > _______________________________________________
> > >>>> > AMBER-Developers mailing list
> > >>>> > AMBER-Developers.ambermd.org
> > >>>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >>>> >
> > >>>> _______________________________________________
> > >>>> AMBER-Developers mailing list
> > >>>> AMBER-Developers.ambermd.org
> > >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >>>>
> > >>>
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Apr 04 2021 - 14:00:02 PDT