Re: [AMBER-Developers] Code review of pmemd.cuda from Ross Walker on 2017-04-17 (Amber Developers Archive Apr 2017)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 17 Apr 2017 18:57:37 -0400

> Future plans:
> - Implement and test a lookup table for erfc(r)/r based on the base 2
> logarithm of r2... it would work very much like pmemd CPU at that point,
> but with a slightly different indexing system. One of those things that
> just has to be tried to see whether it makes things faster but if lookups
> from texture memory aren't terribly expensive it maintain the same accuracy
> and save exp() and exp2f() in the inner loop.

I'll be interested to see how this compares on both performance and precision. Long ago we played with this and explicitly calculating erf and erfc was way faster on the GPU than using a table but things may have changed. The problem was having enough fast memory to do it. Texture memory in the Fermi days was flaky so couldn't be used, note the NODPTEXTURE entries in the code were a hack to get around flaky texture memory that would return garbage occasionally - if one uses texture memory to store the tables it might end up being faster, if texture memory works reliably these days. There isn't enough shared memory unfortunately. I believe at the time we had to use global memory for the table, since the shared memory was too small on those GPUs, and thus the performance was way below explicit computation. Things may have changed though with Pascal so it might be worth evaluating for that and Volta. It will need protection though so it doesn't hobble Kepler and Maxwell GPUs.

Remember anything you add also must work on GeForce - we mustn't add anything that locks us to Tesla.

> - Implement an alternate pipeline for doing the dynamics calculation based
> on the gemstones domain decomposition. A sorting method that keeps atoms
> in both linear and local neighborhood formats, updated at every step, will
> be central to this approach. Developers will then be able to choose
> whether to work with the linear list of atoms or the domain decomposition
> to implement new methods. The code will include numerous stencils for
> importing groups of atoms into __shared__ memory which developers can use
> to implement new energy functions or methods. Once things are in
> __shared__ it's much easier for the uninitiated to write efficeint GPU
> code, so the trick is to make an apparatus for getting there.

This could be useful 'if' we have enough shared memory. It's 96K or so on Pascal, smaller on earlier GPUs. No idea yet what it will be on Volta - I'd recommend getting NVIDIA to confirm that (It might take a lot of arm twisting!, and confirm what it will be for what will be the GeForce silicon) before programming for it specifically. Note I am pretty sure we are almost entirely out of shared memory already so not sure how much additional you'll be able to get in shared memory. Approaches that effectively reduce the shared memory footprint without impacting performance would be very useful for adding more features etc.

All the best
Ross
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Mon Apr 17 2017 - 16:00:04 PDT