Re: [AMBER-Developers] Reliability of GPU calculations

From: B. Lachele Foley <lfoley.ccrc.uga.edu>
Date: Wed, 26 Mar 2014 20:16:01 +0000

> Since the OP said that the NaN's were seen while running the GPU validation suite, I don't think they are code-related

Sure... I was responding to "extensive" and "how long is enough?" We do have an unresolved issue with similar behavior in the GPU code. That said, if they are running exactly the same, completely deterministic job and getting different behavior on different GPU's, it probably is hardware.

:-) Lachele

Dr. B. Lachele Foley
Complex Carbohydrate Research Center
The University of Georgia
Athens, GA USA
lfoley.uga.edu
http://glycam.ccrc.uga.edu

________________________________________
From: Daniel Roe <daniel.r.roe.gmail.com>
Sent: Wednesday, March 26, 2014 3:38 PM
To: AMBER Developers Mailing List
Subject: Re: [AMBER-Developers] Reliability of GPU calculations

On Wed, Mar 26, 2014 at 1:23 PM, B. Lachele Foley <lfoley.ccrc.uga.edu>wrote:

> Some of your energy/NAN issues might be code-related. We've seen that
> sort of behavior, and it isn't the GPU (best we can tell). There is
> extensive discussion of the issue in a bug report. The problem seems to
> preferentially occur when simulating smaller systems. But, I think the
> precise cause(s) has/have not been determined (Scott?). So, you might
> consider not using results like that to determine whether a GPU is bad.
>

Since the OP said that the NaN's were seen while running the GPU validation
suite, I don't think they are code-related at all; the problems are most
likely with the hardware. I think the real issue is how to detect when your
hardware is failing. You should certainly validate your GPUs when they are
first installed, but over time I don't know if there is a good way to
detect failing GPUs besides periodically re-running the validations (and/or
occasionally obtaining crazy results). Also, I think certain GPUs have
proven to be more reliable than others over time (GPU gurus may want to
expand on this). As Ray said, providing adequate cooling for the GPUs is
probably a must when it comes to extending their life.

The GPU code is well-written, incredibly fast, and powerful, but the
results require careful validation just like any other results.

-Dan

--
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 201
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 13:30:03 PDT
Custom Search