Re: [AMBER-Developers] Reliability of GPU calculations

From: B. Lachele Foley <lfoley.ccrc.uga.edu>
Date: Wed, 26 Mar 2014 22:11:43 +0000

What's "FPRE"? (fixed precision/point something...?) Just curious.

:-) Lachele

Dr. B. Lachele Foley
Complex Carbohydrate Research Center
The University of Georgia
Athens, GA USA
lfoley.uga.edu
http://glycam.ccrc.uga.edu

________________________________________
From: Scott Le Grand <varelse2005.gmail.com>
Sent: Wednesday, March 26, 2014 5:39 PM
To: AMBER Developers Mailing List
Subject: Re: [AMBER-Developers] Reliability of GPU calculations

The unique aspect of the problem you observed Lachele is that one step's
worth of FPRE is sufficient to destroy its occurrence. And that has made a
single step repro elude me so far.

This problem sounds like misbehaving GPUs.
On Mar 26, 2014 1:16 PM, "B. Lachele Foley" <lfoley.ccrc.uga.edu> wrote:

> > Since the OP said that the NaN's were seen while running the GPU
> validation suite, I don't think they are code-related
>
> Sure... I was responding to "extensive" and "how long is enough?" We do
> have an unresolved issue with similar behavior in the GPU code. That said,
> if they are running exactly the same, completely deterministic job and
> getting different behavior on different GPU's, it probably is hardware.
>
> :-) Lachele
>
> Dr. B. Lachele Foley
> Complex Carbohydrate Research Center
> The University of Georgia
> Athens, GA USA
> lfoley.uga.edu
> http://glycam.ccrc.uga.edu
>
> ________________________________________
> From: Daniel Roe <daniel.r.roe.gmail.com>
> Sent: Wednesday, March 26, 2014 3:38 PM
> To: AMBER Developers Mailing List
> Subject: Re: [AMBER-Developers] Reliability of GPU calculations
>
> On Wed, Mar 26, 2014 at 1:23 PM, B. Lachele Foley <lfoley.ccrc.uga.edu
> >wrote:
>
> > Some of your energy/NAN issues might be code-related. We've seen that
> > sort of behavior, and it isn't the GPU (best we can tell). There is
> > extensive discussion of the issue in a bug report. The problem seems to
> > preferentially occur when simulating smaller systems. But, I think the
> > precise cause(s) has/have not been determined (Scott?). So, you might
> > consider not using results like that to determine whether a GPU is bad.
> >
>
> Since the OP said that the NaN's were seen while running the GPU validation
> suite, I don't think they are code-related at all; the problems are most
> likely with the hardware. I think the real issue is how to detect when your
> hardware is failing. You should certainly validate your GPUs when they are
> first installed, but over time I don't know if there is a good way to
> detect failing GPUs besides periodically re-running the validations (and/or
> occasionally obtaining crazy results). Also, I think certain GPUs have
> proven to be more reliable than others over time (GPU gurus may want to
> expand on this). As Ray said, providing adequate cooling for the GPUs is
> probably a must when it comes to extending their life.
>
> The GPU code is well-written, incredibly fast, and powerful, but the
> results require careful validation just like any other results.
>
> -Dan
>
> --
> -------------------------
> Daniel R. Roe, PhD
> Department of Medicinal Chemistry
> University of Utah
> 30 South 2000 East, Room 201
> Salt Lake City, UT 84112-5820
> http://home.chpc.utah.edu/~cheatham/
> (801) 587-9652
> (801) 585-6208 (Fax)
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 15:30:02 PDT
Custom Search