Re: [AMBER-Developers] Reliability of GPU calculations

From: Ray Luo, Ph.D. <ray.luo.uci.edu> Date: Wed, 26 Mar 2014 12:15:59 -0700

--
Ray Luo, Ph.D.
Professor,
Biochemistry, Molecular Biophysics, and
Biomedical Engineering
University of California, Irvine, CA 92697-3900
On Wed, Mar 26, 2014 at 11:13 AM, Robert Konecny <rok.ucsd.edu> wrote:
> Hi all,
>
> we have just discovered (thanks to the diligence of one of our users) that
> on our local GPU cluster 7 out of 24 GPU cards went bad. This is an older
> (1.5y old) GPU cluster with 24 GTX580s. It is not surprising these cards
> fail after some time (after all these are consumer grade gaming cards) but
> the disconcerting thing is the way they fail.
>
> This user is starting extensive Amber GPU calculations and he ran the Amber
> GPU validation suite on all the cards twenty times. The errors showed up
> either as wrong energies or NaNs. However, these errors did not occur
> consistently but only in some of these twenty trials. Since most of our
> users are not running the same simulations multiple times it is very hard
> to detect a failing card in time. The inconsistency of the errors is the
> troublesome issue and this is very different from behavior of the CPU-bound
> jobs.
>
> So my question is - how do we prevent this? Should the users be running
> validation tests before and after each simulation? How many times? How long
> is long enough? Is there a better mechanism to detect GPU hardware errors?
>
> What are the recommendations from the developers on this issue?
>
> Thanks,
>
> Robert
>
>
> PS. This is a summary of the errors:
>
>    - compute-0-0, GPU ID 1.  Incorrect energies in six out of twenty trials.
>    - compute-0-0, GPU ID 3.  Incorrect energies, NaNs, and ******s in twenty out of twenty trials.
>    - compute-0-1, GPU ID 3.  Incorrect energies, NaNs, and ******s in five out of twenty trials.
>    - compute-0-2, GPU ID 2.  MD froze at 445,000 steps in one of the twenty trials.
>    - compute-0-3, GPU ID 0.  Incorrect energies in one out of twenty trials.
>    - compute-0-4, GPU ID 2.  Incorrect energies in one out of twenty trials.
>    - compute-0-5, GPU ID 0.  Incorrect energies, NaNs, and ******s in fifteen out of twenty trials.
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers