Re: [AMBER-Developers] CMake in Amber from David Cerutti on 2021-04-04 (Amber Developers Archive Apr 2021)

From: David Cerutti <dscerutti.gmail.com>
Date: Sun, 4 Apr 2021 16:17:36 -0400

My suggestion on the JAC 2fs benchmark would be to look at the temperature
of that system: it's up around 400K, which is going to be breaking the
pairlist much more frequently and isn't really safe for 2fs NVE in the
first place. I'm with you on the idea of looking at small systems to find
better ways to distribute over many cores and more SMs (one metric might
even be the overall amount of L1 that's coming with those expanding arrays
of multiprocessors). But make sure to have a good handle on the science
and what people want to do with these calculations before committing to raw
speed on an ordinary, small system.

If you're building pmemd2.0, is it just the CUDA part or will the
underlying Fortran code get refactored and renovated? My feeling is that a
professional programmer could recode pmemd as C++ with added features to
read any number of topologies, multiple systems at once that are only
coupled as much as the problem at hand wants them to be (or not). Four
copies of JAC running in one executable would probably run at 2x the
original throughput, especially if you can batch the FFTs, and even smaller
systems like a 10-15k atom hydration free energy calculation could run 4-8x
batched calculations at more than 4x the original throughput. The free
energy calculations and most other things people are really hoping to do
with MD require completing multiple related trajectories, not one flat-out
equilibrium run.

But, if you can keep up the 64-bit accumulation and maintain the perf,
that's some very impressive numerical acrobatics. Will be interesting to
see the results, as there are classes of calculations that require double
precision in many more places than standard MD. Also I'm cheering for you
if you can find a use for these fancy new 32-bit tensor cores. Look at the
mdgx GB code--that uses one warp to do a 16x16 tile, eight rounds of 32
pair calculations at a time reading the x atoms into the first 16 lanes and
the y atoms into the second 16 lanes. May help you in the PME refactor, or
get you to that magic number 16 for feeding into whatever tensor core
operations.

Keep up the great work,
Dave

On Sun, Apr 4, 2021 at 4:02 PM Scott Le Grand <varelse2005.gmail.com> wrote:

> PPPS There should be no explicit limit on TI atoms whatsoever. AMBER should
> handle that under the hood to let the scientists science the &h!+ out of
> things, fight me on that.
>
> On Sun, Apr 4, 2021 at 12:58 PM Scott Le Grand <varelse2005.gmail.com>
> wrote:
>
> > PPS the future* would appear to be more cores of approximately the same
> > computational power as Ampere, not the same number of cores but beefier.
> As
> > such, we need to figure out how to distribute the same basis set of
> > calculations across more cores going forward. Doubly so now that we have
> > NVLINK, an interconnect that makes multi-GPU not suck.
> >
> > *A prediction pulled entirely from my Easter Bonnet(tm) based on the
> > progression from SM 5 and SM 8 and which should not at all be construed
> as
> > insider information because it's not.
> >
> > On Sun, Apr 4, 2021 at 12:50 PM Scott Le Grand <varelse2005.gmail.com>
> > wrote:
> >
> >> PS I'm killing off both TI paths and writing the path I wanted to write
> >> in the first place that both exploits the original Uber-kernels and
> >> Taisung's multi-streaming variant incorporating Darren and Taisung's
> >> improvements in the science whilst doing so. After those six impossible
> >> things or so, breakfast...
> >>
> >> On Sun, Apr 4, 2021 at 12:46 PM Scott Le Grand <varelse2005.gmail.com>
> >> wrote:
> >>
> >>> 1) Because that's the benchmark we've used since day one. Apples to
> >>> apples and all that. It's a relatively small system for single GPU
> which is
> >>> the perfect stand-in for large system multi-GPU efficiency. My goal is
> 4x
> >>> scaling on 8 GPUs with a positive scaling experience beyond that
> relaxing
> >>> system size limits up to 1B atoms in the process. If JAC gets faster,
> we
> >>> can scale farther.
> >>> 2) Because the path to AMBER 20 broke multiple implicit assumptions in
> >>> my design* for AMBER so I went back in time to change the future. All
> >>> relevant functionality will be restored over time, but I spent 6
> months of
> >>> 2020 trying to do exactly that before throwing my arms up in utter
> >>> frustration. The alternative is walking away from all the code and
> starting
> >>> a new framework.
> >>> 3) RTX3090**
> >>> 4) Remember this is PMEMD 2.0 we're building here. It's been almost 12
> >>> years, it's time to rewrite.
> >>>
> >>> But... Your local force code still shows an acceleration over the
> >>> original local force code even at full 64-bit accumulation. So that's
> >>> getting refactored along the way. Everything else so far ia a perf
> >>> regression without the precision model changes alas. But... you and I
> have
> >>> accidentally created working variant of SPXP. Your stuff will live
> again in
> >>> its reviva and you get first authorship IMO because while it's great
> work,
> >>> it's *not* SPFP with those precision changes in place (18-bit mantissa?
> >>> C'mon man(tm)...)
> >>>
> >>> *Should have spelled them out, but even I couldn't predict the end of
> >>> the CUDA Fellow program a priori ending any support for further work,
> but
> >>> now my bosses have let we work on it again as my dayjob.
> >>> **
> >>>
> https://www.exxactcorp.com/blog/Molecular-Dynamics/rtx3090-benchmarks-for-hpc-amber-a100-vs-rtx3080-vs-2080ti-vs-rtx6000
> >>>
> >>> On Sun, Apr 4, 2021 at 12:03 PM David Cerutti <dscerutti.gmail.com>
> >>> wrote:
> >>>
> >>>> "Meanwhile, AMBER16 refactored to SM 7 and beyond is already hitting
> 730
> >>>> ns/day on JAC NVE 2 fs. AMBER20 with the grid interpolation and local
> >>>> force
> >>>> precision sub FP32 force hacks removed hits 572 ns/day (down from 632
> if
> >>>> left in as we shipped it). That puts me nearly 1/3 to my goal of
> >>>> doubling
> >>>> overall AMBER performance which is what is important to me and where
> I'm
> >>>> going to focus my efforts..."
> >>>>
> >>>> Please explain here.
> >>>> 1.) Why are we back to using the old JAC NVE 2fs benchmark? The new
> >>>> benchmarks were redesigned several years ago to make more uniform
> tests
> >>>> and
> >>>> take settings that standard practitioners are now using.
> >>>> 2.) Why is Amber16 being refactored rather than Amber20?
> >>>> 3.) What does it mean to be hitting 730 ns/day? What card is being
> >>>> compared here--the Amber20 benchmarks look like they could be a V100,
> >>>> Titan-V, or perhaps an RTX-2080Ti.
> >>>>
> >>>>
> >>>> On Sun, Apr 4, 2021 at 12:11 PM Scott Le Grand <varelse2005.gmail.com
> >
> >>>> wrote:
> >>>>
> >>>> > But getting back on topic, CUDA 7.5 is a 2015 toolkit and SM 5.x and
> >>>> below
> >>>> > are deprecated now. SM 6 is a huge jump over SM 5 enabling true
> >>>> virtual
> >>>> > memory and I suggest deprecating support for SM 5 across the board.
> >>>> SM 7
> >>>> > and beyond alas mostly complicated warp programming and introduced
> >>>> tensor
> >>>> > cores which currently seem useless for straight MD, but perfect for
> >>>> running
> >>>> > AI models inline with MD.
> >>>> >
> >>>> > CUDA 8 is a 2017 toolkit. That's way too soon to deprecate IMO and
> if
> >>>> cmake
> >>>> > has ish with it, that's a reason not to use cmake, not a reason to
> >>>> > deprecate CUDA 8.
> >>>> >
> >>>> >
> >>>> > On Sun, Apr 4, 2021 at 8:55 AM Scott Le Grand <
> varelse2005.gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > > Ross sent me two screenshots of cmake losing its mind with an 11.x
> >>>> > > toolkit. I'll file an issue, but no, I'm not going to fix cmake
> >>>> issues
> >>>> > > myself at all. I'm open to someone convincing me cmake is better
> >>>> than the
> >>>> > > configure script, but no one has made that argument yet beyond
> >>>> "because
> >>>> > > cmake" and until that happens, that just doesn't work for me.
> Happy
> >>>> to
> >>>> > > continue helping with the build script that worked until convinced
> >>>> > > otherwise. Related: I still use nvprof, fight me.
> >>>> > >
> >>>> > > Meanwhile, AMBER16 refactored to SM 7 and beyond is already
> hitting
> >>>> 730
> >>>> > > ns/day on JAC NVE 2 fs. AMBER20 with the grid interpolation and
> >>>> local
> >>>> > force
> >>>> > > precision sub FP32 force hacks removed hits 572 ns/day (down from
> >>>> 632 if
> >>>> > > left in as we shipped it). That puts me nearly 1/3 to my goal of
> >>>> doubling
> >>>> > > overall AMBER performance which is what is important to me and
> >>>> where I'm
> >>>> > > going to focus my efforts as opposed to the new shiny build system
> >>>> that
> >>>> > is
> >>>> > > getting better (and I *hate* cmake for cmake's sake), but we
> rushed
> >>>> it to
> >>>> > > production IMO like America reopened before the end of the
> pandemic.
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Sun, Apr 4, 2021 at 5:51 AM David A Case <
> david.case.rutgers.edu
> >>>> >
> >>>> > > wrote:
> >>>> > >
> >>>> > >> On Sat, Apr 03, 2021, Scott Le Grand wrote:
> >>>> > >>
> >>>> > >> >cmake is still not quite ready for prime time disruption of
> >>>> configure.
> >>>> > >> It's
> >>>> > >> >getting there though.
> >>>> > >>
> >>>> > >> If there are problems with cmake, please create an issue on
> >>>> gitlab, and
> >>>> > >> mention .multiplemonomials to get Jamie's attention. Please try
> to
> >>>> > avoid
> >>>> > >> the syndrome of saying "I can get this to work with configure,
> and
> >>>> I'm
> >>>> > to
> >>>> > >> busy right now to do anything else."
> >>>> > >>
> >>>> > >> I have removed the documentation for the configure process in the
> >>>> > Amber21
> >>>> > >> Reference Manual, although the files are still present. We can't
> >>>> > continue
> >>>> > >> to support and test two separate build systems, each with their
> own
> >>>> > bugs.
> >>>> > >>
> >>>> > >> ...thx...dac
> >>>> > >>
> >>>> > >>
> >>>> > >> _______________________________________________
> >>>> > >> AMBER-Developers mailing list
> >>>> > >> AMBER-Developers.ambermd.org
> >>>> > >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>> > >>
> >>>> > >
> >>>> > _______________________________________________
> >>>> > AMBER-Developers mailing list
> >>>> > AMBER-Developers.ambermd.org
> >>>> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>> >
> >>>> _______________________________________________
> >>>> AMBER-Developers mailing list
> >>>> AMBER-Developers.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>>>
> >>>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sun Apr 04 2021 - 13:30:02 PDT