On Wed, Mar 26, 2014 at 1:23 PM, B. Lachele Foley <lfoley.ccrc.uga.edu>wrote:
> Some of your energy/NAN issues might be code-related.  We've seen that
> sort of behavior, and it isn't the GPU (best we can tell).  There is
> extensive discussion of the issue in a bug report.  The problem seems to
> preferentially occur when simulating smaller systems.  But, I think the
> precise cause(s) has/have not been determined (Scott?).  So, you might
> consider not using results like that to determine whether a GPU is bad.
>
Since the OP said that the NaN's were seen while running the GPU validation
suite, I don't think they are code-related at all; the problems are most
likely with the hardware. I think the real issue is how to detect when your
hardware is failing. You should certainly validate your GPUs when they are
first installed, but over time I don't know if there is a good way to
detect failing GPUs besides periodically re-running the validations (and/or
occasionally obtaining crazy results). Also, I think certain GPUs have
proven to be more reliable than others over time (GPU gurus may want to
expand on this). As Ray said, providing adequate cooling for the GPUs is
probably a must when it comes to extending their life.
The GPU code is well-written, incredibly fast, and powerful, but the
results require careful validation just like any other results.
-Dan
-- 
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 201
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 13:00:02 PDT