Re: [AMBER-Developers] pmemd.cuda SpeedBoost branch

From: David Cerutti <dscerutti.gmail.com>
Date: Mon, 29 May 2017 22:21:15 -0400

I have updated the branch with another small improvement in the speed but
what I think will be a larger improvement in other respects: shared memory
force accumulators in standard PME have been removed in favor of keeping
all force accumulators in registers, and a great deal of the branching in
the inner loop has been removed to support the change. This means that
anyone with something to do in the direct space sum will have a lot more
__shared__ to work with should they need it, and we can up the thread
counts per block if on new hardware the number of cores grows faster than
the on-chip memory. There is no additional change to the precision model
or the order of accumulation, but the code is somewhat simplified by the
move.

Support for architectures below Kepler has also been dropped: anything
found inside a pre-processor directive like __CUDA_ARCH__ < 300 has been
deleted, and other pre-processor directives designed at ensuring
__CUDA_ARCH__ >= 300 have been removed as that condition is now standard.
This has removed lots of pre-processor branches and several thousand lines
of code. I will probably go after macros that were written out of a need
to support the early CUDA architectures and the modern stuff simultaneously
in order to make it clear to newcomers what the inner loops are doing.

Volta, more precisely it seems CUDA 9.0, is shaping up to be a wild ride.
My present understanding is that CUDA 9.0 will be deployable all the way
back to Kepler, and that it will remove warp synchronization, which was a
big limitation that we learned to treat as a bedrock and build code on.
Volta will require CUDA 9.0 to compile and run code, so breaking warp
synchronization and giving us the options of choosing to lock groups of 2,
4, 8, 16, or 32 threads together will break our current code and provide
huge opportunities to improve it once we rebuild. It looks like the
__shfl() functions I've added will transition decently to the new built-in
__shfl_sync() as part of the effort to get the code to run safely, but it's
conceivable we should rethink the algorithm.

Dave


On Sat, May 20, 2017 at 9:57 AM, David Cerutti <dscerutti.gmail.com> wrote:

> Dear Amber Devs,
>
>
> I’ve checked in a new branch, “SpeedBoost,” which contains new tweaks to
> pmemd.cuda intended to help us stay on top of the academic world and get
> back on pace with D. E. Shaw’s GPU code. Something Shaw group did in
> 2015 helped their code to run about 30% faster than ours, even after
> counting the multiple timesteps they do in the reciprocal space part. I
> think I see ways to get that much more out of pmemd.cuda, and the tweaks
> I’ve made so far are a boost of 10-14%.
>
>
> The first tweak I’ve made is to use an adaptively indexed spline to
> compute the electrostatic direct space derviatives. Because I can
> calculate one number directly, rather than relying on a series of further
> adds, mults, and exp() computations, I can tweak the last significant bits
> of the spline coefficients to get better results out of 32-bit floats. I’ve
> been able to exceed the current precision in the electrostatic derivatives
> by a factor of 5 or 6, so the splines are both faster and more accurate.
>
>
> The other major thing I’ve done so far is to change the way force, energy,
> and virial accumulations are handled in the SPFP direct space routines. Rather
> than immediately up-casting each floating point force or energy
> contribution to a long long int, I accumulate up to 16 force contributions
> from a single thread into 32 bit floats, then upcast the sum before sending
> it to whatever accumulator is needed. This removes a lot of float to int
> conversions and saves a little more run time. The determinism remains,
> because the order in which each number is accumulated on its own warp is
> preserved, and integer conversions still take place before pooling the
> results of multiple warps. There is no danger of breaking a 32-bit
> integer format with this approach—it goes directly form 32-bit float to
> 64-bit int. However, if there is a very big force, it will wipe out the
> last few significant bits of 15 other forces, not just its own contribution.
> I’ve estimated the loss of precision involved and it is beneath the level
> of error I referenced in an earlier message about having our coordinates in
> 32-bit precision.
>
>
> By itself, the change in force accumulation probably wouldn’t be worth any
> added uncertainty, but it sets up for more powerful changes to the way
> warps process the pair interactions. I am confident that the changes I
> have made are safe, but they need to be thoroughly tested. Please check
> out the branch and give it a try; you should see a decent speedup now, and
> I hope to check in further improvements shortly.
>
>
> Dave
>
>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon May 29 2017 - 19:30:03 PDT
Custom Search