Robert,
Just an observation while I'm building my GPU cluster. It looks like
the cards are way too close to each other when there are three or four
side by side ... Given each of them is running with a working
temperature (~80C, lower than the spec), the heat accumulation is very
fast. It's not surprising if you see premature malfunction of
electronic components either on the card or on the motherboard.
The difference between us and gamers is that we run MD jobs 24/7
nonstop. You rarely see a gamer stress their cards this much. It'd be
great if they last for one year with MD jobs. Of course if you can
throw in some heavy duty liquid coolers or fans somehow to lower the
working temperature, you might be able to overcome this stability
issue.
Ray
--
Ray Luo, Ph.D.
Professor,
Biochemistry, Molecular Biophysics, and
Biomedical Engineering
University of California, Irvine, CA 92697-3900
On Wed, Mar 26, 2014 at 11:13 AM, Robert Konecny <rok.ucsd.edu> wrote:
> Hi all,
>
> we have just discovered (thanks to the diligence of one of our users) that
> on our local GPU cluster 7 out of 24 GPU cards went bad. This is an older
> (1.5y old) GPU cluster with 24 GTX580s. It is not surprising these cards
> fail after some time (after all these are consumer grade gaming cards) but
> the disconcerting thing is the way they fail.
>
> This user is starting extensive Amber GPU calculations and he ran the Amber
> GPU validation suite on all the cards twenty times. The errors showed up
> either as wrong energies or NaNs. However, these errors did not occur
> consistently but only in some of these twenty trials. Since most of our
> users are not running the same simulations multiple times it is very hard
> to detect a failing card in time. The inconsistency of the errors is the
> troublesome issue and this is very different from behavior of the CPU-bound
> jobs.
>
> So my question is - how do we prevent this? Should the users be running
> validation tests before and after each simulation? How many times? How long
> is long enough? Is there a better mechanism to detect GPU hardware errors?
>
> What are the recommendations from the developers on this issue?
>
> Thanks,
>
> Robert
>
>
> PS. This is a summary of the errors:
>
> - compute-0-0, GPU ID 1. Incorrect energies in six out of twenty trials.
> - compute-0-0, GPU ID 3. Incorrect energies, NaNs, and ******s in twenty out of twenty trials.
> - compute-0-1, GPU ID 3. Incorrect energies, NaNs, and ******s in five out of twenty trials.
> - compute-0-2, GPU ID 2. MD froze at 445,000 steps in one of the twenty trials.
> - compute-0-3, GPU ID 0. Incorrect energies in one out of twenty trials.
> - compute-0-4, GPU ID 2. Incorrect energies in one out of twenty trials.
> - compute-0-5, GPU ID 0. Incorrect energies, NaNs, and ******s in fifteen out of twenty trials.
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 12:30:02 PDT