Re: [AMBER-Developers] Reliability of GPU calculations

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 26 Mar 2014 15:18:46 -0700

Hi Robert,

Yeap that's definitely faulty CPUs. I'd see if they are still under
warranty. If the GPUs weren't tested at the time of installation it's
possible some of them were marginal to begin with and have got worse over
time. My recommendation is to test the GPUs for approximately 48 hours at
the time of installation and then repeat that maybe once every 2 or 3
months or so (just the 20 runs is probably good here) and also to run a
test on a GPU whenever you see behavior that might indicate an issue -
such as random launch errors or NANs on simulations you know to be good.

I am not overly surprised with the 580s. Both the 480 Fermi and 580 Fermi
chips run exceedingly hot (there have also been C2050's that have failed
as well - the annoying thing is ECC doesn't pick it up most of the time
either).

What cases do you have them in? Is it ducted fans? - With 4 GPUs in a box
you really should have them in some kind of ducted high volume cooling
system. Having them in a machine room at 95F is also probably not helpful
for their longevity but that's a whole other battle.

I am working on making an easy to run standalone AMBER based validation
suite for GPU systems since most of the test cases out there aren't
aggressive enough to show up these types of problems.

All the best
Ross



On 3/26/14, 11:13 AM, "Robert Konecny" <rok.ucsd.edu> wrote:

>Hi all,
>
>we have just discovered (thanks to the diligence of one of our users)
>that
>on our local GPU cluster 7 out of 24 GPU cards went bad. This is an older
>(1.5y old) GPU cluster with 24 GTX580s. It is not surprising these cards
>fail after some time (after all these are consumer grade gaming cards)
>but
>the disconcerting thing is the way they fail.
>
>This user is starting extensive Amber GPU calculations and he ran the
>Amber
>GPU validation suite on all the cards twenty times. The errors showed up
>either as wrong energies or NaNs. However, these errors did not occur
>consistently but only in some of these twenty trials. Since most of our
>users are not running the same simulations multiple times it is very hard
>to detect a failing card in time. The inconsistency of the errors is the
>troublesome issue and this is very different from behavior of the
>CPU-bound
>jobs.
>
>So my question is - how do we prevent this? Should the users be running
>validation tests before and after each simulation? How many times? How
>long
>is long enough? Is there a better mechanism to detect GPU hardware errors?
>
>What are the recommendations from the developers on this issue?
>
>Thanks,
>
>Robert
>
>
>PS. This is a summary of the errors:
>
> - compute-0-0, GPU ID 1. Incorrect energies in six out of twenty
>trials.
> - compute-0-0, GPU ID 3. Incorrect energies, NaNs, and ******s in
>twenty out of twenty trials.
> - compute-0-1, GPU ID 3. Incorrect energies, NaNs, and ******s in
>five out of twenty trials.
> - compute-0-2, GPU ID 2. MD froze at 445,000 steps in one of the
>twenty trials.
> - compute-0-3, GPU ID 0. Incorrect energies in one out of twenty
>trials.
> - compute-0-4, GPU ID 2. Incorrect energies in one out of twenty
>trials.
> - compute-0-5, GPU ID 0. Incorrect energies, NaNs, and ******s in
>fifteen out of twenty trials.
>
>_______________________________________________
>AMBER-Developers mailing list
>AMBER-Developers.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber-developers



_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 15:30:03 PDT
Custom Search