Re: [AMBER-Developers] [AMBER] manual prmtop file editing for free energy calculation from Scott Le Grand on 2012-05-30 (Amber Developers Archive May 2012)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 30 May 2012 21:56:55 -0700

Yes it's a cache issue. Cache space (also known as shared memory) is
precious and scarce...

Tell NVIDIA to raise it from 48K where it's been since 2009 to 128K and
we're back in business (dropping SM 2.0 and SM 3.0 support in the process
but alas)...

Scott

On Wed, May 30, 2012 at 9:53 PM, Jason Swails <jason.swails.gmail.com>wrote:

> On Wed, May 30, 2012 at 11:37 PM, case <case.biomaps.rutgers.edu> wrote:
>
> > On Wed, May 30, 2012, Scott Le Grand wrote:
> > >
> > > However, pmemd.cuda attains its performance boost by running entirely
> out
> > > of cache. With the old approach, 8 * O(m^2) + 4 bytes per atom in the
> > > cache is consumed as opposed to 8 bytes per atom for storing per atom.
> >
> > I'm still lost. Don't you need both cn1 and cn2 (or r*, eps)
> > for each atom (giving 16 bytes per atom), vs. 4 bytes per atom for the
> > table lookup scheme (plus 8 * O(m^2) for the table itself.) ?
> >
>
> Scott and I traded a number of emails when I learned how GPUs did the VDW
> calculations, and here is my understanding from that exchange. (And if
> Scott corrects me here I get to learn a little more)
>
> To do a simple table look-up like the CPU code does, you would need to
> store the CN1 and CN2 tables in shared memory (presumably so every
> block/thread could access those values when the kernel executes), but there
> is little to no shared memory left (limiting at ~20-30 types if we want to
> continue supporting C1060s). If we strive to keep CN1/CN2 out of shared
> memory, then each atom would need to store the full CN1/CN2 tables, as well
> as its integer index, which is significantly more than just sigma and
> epsilon, or we suffer severe performance penalties.
>
> Therefore, it's not the sheer amount of memory, but rather a particular
> flavor of memory that is saved here (namely, shared memory, which is quite
> limited on GPUs as I understand it).
>
> This is the only way it makes sense to me since the table lookup scheme
> takes considerably less memory given that NTYPES tends to max out at ~20-30
> types.
>
> All the best,
> Jason
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed May 30 2012 - 22:00:04 PDT