On Wed, May 30, 2012 at 11:37 PM, case <case.biomaps.rutgers.edu> wrote:
> On Wed, May 30, 2012, Scott Le Grand wrote:
> >
> > However, pmemd.cuda attains its performance boost by running entirely out
> > of cache. With the old approach, 8 * O(m^2) + 4 bytes per atom in the
> > cache is consumed as opposed to 8 bytes per atom for storing per atom.
>
> I'm still lost. Don't you need both cn1 and cn2 (or r*, eps)
> for each atom (giving 16 bytes per atom), vs. 4 bytes per atom for the
> table lookup scheme (plus 8 * O(m^2) for the table itself.) ?
>
Scott and I traded a number of emails when I learned how GPUs did the VDW
calculations, and here is my understanding from that exchange. (And if
Scott corrects me here I get to learn a little more)
To do a simple table look-up like the CPU code does, you would need to
store the CN1 and CN2 tables in shared memory (presumably so every
block/thread could access those values when the kernel executes), but there
is little to no shared memory left (limiting at ~20-30 types if we want to
continue supporting C1060s). If we strive to keep CN1/CN2 out of shared
memory, then each atom would need to store the full CN1/CN2 tables, as well
as its integer index, which is significantly more than just sigma and
epsilon, or we suffer severe performance penalties.
Therefore, it's not the sheer amount of memory, but rather a particular
flavor of memory that is saved here (namely, shared memory, which is quite
limited on GPUs as I understand it).
This is the only way it makes sense to me since the table lookup scheme
takes considerably less memory given that NTYPES tends to max out at ~20-30
types.
All the best,
Jason
--
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed May 30 2012 - 22:00:02 PDT