Re: [AMBER-Developers] weird behavior for pmemd.cuda on Volta cards from David Cerutti on 2017-12-09 (Amber Developers Archive Dec 2017)

From: David Cerutti <dscerutti.gmail.com>
Date: Sat, 9 Dec 2017 15:53:06 -0500

One thing to add here, for anyone benchmarking on Volta, we've found an
issue with my latest optimizations. Something in the non-bonded kernel is
vomiting registers, in Scott's lingo, and this is causing spills into
global. This is causing the new non-bonded kernel, which is 15-20% faster
on architectures up through GP100, to run 20% slower on Volta. Ke Li at
NVIDIA is kindly looking into the problem. If he and I can fix it, the
non-bonded kernel will carry Volta to well over 1000ns/day on JAC and
fulfill the improvement margins that other codes have been seeing.

Dave

On Sat, Dec 9, 2017 at 1:30 PM, Scott Le Grand <varelse2005.gmail.com>
wrote:

> Also might I add that I think deep learning is the worst thing that has
> ever happened to GPUs. We are now wasting transistors on otherwise useless
> math modes like fp16, getting hounded by Nvidia for using consumer GPUs for
> science, and the bang per Buck for GPUs now decreases with every
> generation. Help us Obi-Wan anyone but NVIDIA, you're our only hope.
>
> On Dec 9, 2017 9:33 AM, "Ross Walker" <ross.rosswalker.co.uk> wrote:
>
> > Hi Dave
> >
> > The published benchmarks for GTX cards are with base clocks. I suspect
> > Adrian's GTX-1080TIs are overclocked models which would explain it. I
> > always publish benchmarks based on base clocks since I think that gives a
> > fairer representation of the performance to expect for our end users.
> I.e.
> > the benchmarks numbers should represent a floor to the performance. Our
> > users are thus happy when they see better performance than what is
> > published. :-)
> >
> > BTW, Dave something you will likely rediscover in your work that Scott
> and
> > I originally found around 2013. The temperature and power caps (they are
> > independent) are a real pain in the butt for development. Numerous times
> we
> > put in optimizations that boosted throughput only to find out when
> running
> > extended benchmarks that it was actually 'too efficient' for the GPU and
> > caused it to either consume more power or run hotter which ultimately
> > caused it to clock down and in a many cases actually run slower than the
> > 'less efficient' code. You pretty much need to keep a box running flat
> out
> > and at production temperature while doing development work and then swap
> in
> > and out your benchmarks for your optimized code. Always running at least
> 5
> > mins to get a reliable performance number. Trust me you will pull out
> lots
> > of hair chasing the damn boost clock - it is super annoying.
> >
> > Then count yourself lucky that you are running on one GPU. Try to
> optimize
> > code that runs across multiple GPUs on multiple nodes where they are all
> in
> > different states of boost clock and you will discover what insanity feels
> > like.
> >
> > All the best
> > Ross
> >
> > > On Dec 8, 2017, at 15:23, David Cerutti <dscerutti.gmail.com> wrote:
> > >
> > > One thing that's struck me about Adrian's benchmarks is that they
> > > consistently seem to beat the published values. Back in August
> Adrian's
> > > crew was getting 698ns on JAC 4fs NVE with Amber16 / GTX-1080Ti, where
> > the
> > > website published 624. He got 765 with my improvements as of that
> date,
> > so
> > > I think now he should be pushing 800 on that card. We'll see. Once we
> > > figure out what's going on with our Volta I think our next steps on the
> > > code will become clear.
> > >
> > > On Fri, Dec 8, 2017 at 3:00 PM, David A Case <david.case.rutgers.edu>
> > wrote:
> > >
> > >> On Fri, Dec 08, 2017, Dan Roe wrote:
> > >>>
> > >>> One simple way to test this is when you notice the slowdown run
> > >>> 'nvidia-smi' and note the performance state. If it's anything but P0
> > >>> you're being throttled. I believe this can also happen due to too
> much
> > >>> power consumption.
> > >>
> > >> The Perf value stays at P0, but I notice that the power usage declines
> > >> from 139W to 103W, then levels out at about 120W. Cap is at 250W.
> > >> Temperature never changes from 82C.
> > >>
> > >> It's still not clear to me that we actually have an identical software
> > >> environment to what Adrian has. But I admit it looks more like
> > >> hardware....
> > >>
> > >> ....dac
> > >>
> > >>
> > >> _______________________________________________
> > >> AMBER-Developers mailing list
> > >> AMBER-Developers.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> > >>
> > > _______________________________________________
> > > AMBER-Developers mailing list
> > > AMBER-Developers.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> >
> > _______________________________________________
> > AMBER-Developers mailing list
> > AMBER-Developers.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber-developers
> >
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Dec 09 2017 - 13:00:02 PST