[AMBER-Developers] Reliability of GPU calculations

From: Robert Konecny <rok.ucsd.edu>
Date: Wed, 26 Mar 2014 11:13:29 -0700

Hi all,

we have just discovered (thanks to the diligence of one of our users) that
on our local GPU cluster 7 out of 24 GPU cards went bad. This is an older
(1.5y old) GPU cluster with 24 GTX580s. It is not surprising these cards
fail after some time (after all these are consumer grade gaming cards) but
the disconcerting thing is the way they fail.

This user is starting extensive Amber GPU calculations and he ran the Amber
GPU validation suite on all the cards twenty times. The errors showed up
either as wrong energies or NaNs. However, these errors did not occur
consistently but only in some of these twenty trials. Since most of our
users are not running the same simulations multiple times it is very hard
to detect a failing card in time. The inconsistency of the errors is the
troublesome issue and this is very different from behavior of the CPU-bound

So my question is - how do we prevent this? Should the users be running
validation tests before and after each simulation? How many times? How long
is long enough? Is there a better mechanism to detect GPU hardware errors?

What are the recommendations from the developers on this issue?



PS. This is a summary of the errors:

   - compute-0-0, GPU ID 1. Incorrect energies in six out of twenty trials.
   - compute-0-0, GPU ID 3. Incorrect energies, NaNs, and ******s in twenty out of twenty trials.
   - compute-0-1, GPU ID 3. Incorrect energies, NaNs, and ******s in five out of twenty trials.
   - compute-0-2, GPU ID 2. MD froze at 445,000 steps in one of the twenty trials.
   - compute-0-3, GPU ID 0. Incorrect energies in one out of twenty trials.
   - compute-0-4, GPU ID 2. Incorrect energies in one out of twenty trials.
   - compute-0-5, GPU ID 0. Incorrect energies, NaNs, and ******s in fifteen out of twenty trials.

AMBER-Developers mailing list
Received on Wed Mar 26 2014 - 11:30:03 PDT
Custom Search