Re: [AMBER-Developers] Reliability of GPU calculations from Robert Konecny on 2014-03-26 (Amber Developers Archive Mar 2014)

From: Robert Konecny <rok.ucsd.edu>
Date: Wed, 26 Mar 2014 15:50:04 -0700

Hi Ross,

this is very useful - thanks. I'm not too worried about the failed cards,
the vendor will be replacing them for us eventough they are 1.5y old. But
we will have to incorporate some kind of regular hardware validation
protocol on our clusters - as you are suggesting. We will be automatically
scheduling this every month - this should at least give us a smaller window
when we can detect a failure. We will be also advising our users to do the
same - validate the hardware from time to time, not just on our local
clusters but on any GPU hardware they have access to. But I suspect this
will be an uphill battle to educate our users about this issue ...

Having a standalone comprehensive validation suite would be great!

Thanks!

Robert

On Wed, Mar 26, 2014 at 03:18:46PM -0700, Ross Walker wrote:
> Hi Robert,
>
> Yeap that's definitely faulty CPUs. I'd see if they are still under
> warranty. If the GPUs weren't tested at the time of installation it's
> possible some of them were marginal to begin with and have got worse over
> time. My recommendation is to test the GPUs for approximately 48 hours at
> the time of installation and then repeat that maybe once every 2 or 3
> months or so (just the 20 runs is probably good here) and also to run a
> test on a GPU whenever you see behavior that might indicate an issue -
> such as random launch errors or NANs on simulations you know to be good.
>
> I am not overly surprised with the 580s. Both the 480 Fermi and 580 Fermi
> chips run exceedingly hot (there have also been C2050's that have failed
> as well - the annoying thing is ECC doesn't pick it up most of the time
> either).
>
> What cases do you have them in? Is it ducted fans? - With 4 GPUs in a box
> you really should have them in some kind of ducted high volume cooling
> system. Having them in a machine room at 95F is also probably not helpful
> for their longevity but that's a whole other battle.
>
> I am working on making an easy to run standalone AMBER based validation
> suite for GPU systems since most of the test cases out there aren't
> aggressive enough to show up these types of problems.
>
> All the best
> Ross
>
>
>
> On 3/26/14, 11:13 AM, "Robert Konecny" <rok.ucsd.edu> wrote:
>
> >Hi all,
> >
> >we have just discovered (thanks to the diligence of one of our users)
> >that
> >on our local GPU cluster 7 out of 24 GPU cards went bad. This is an older
> >(1.5y old) GPU cluster with 24 GTX580s. It is not surprising these cards
> >fail after some time (after all these are consumer grade gaming cards)
> >but
> >the disconcerting thing is the way they fail.
> >
> >This user is starting extensive Amber GPU calculations and he ran the
> >Amber
> >GPU validation suite on all the cards twenty times. The errors showed up
> >either as wrong energies or NaNs. However, these errors did not occur
> >consistently but only in some of these twenty trials. Since most of our
> >users are not running the same simulations multiple times it is very hard
> >to detect a failing card in time. The inconsistency of the errors is the
> >troublesome issue and this is very different from behavior of the
> >CPU-bound
> >jobs.
> >
> >So my question is - how do we prevent this? Should the users be running
> >validation tests before and after each simulation? How many times? How
> >long
> >is long enough? Is there a better mechanism to detect GPU hardware errors?
> >
> >What are the recommendations from the developers on this issue?
> >
> >Thanks,
> >
> >Robert
> >
> >
> >PS. This is a summary of the errors:
> >
> > - compute-0-0, GPU ID 1. Incorrect energies in six out of twenty
> >trials.
> > - compute-0-0, GPU ID 3. Incorrect energies, NaNs, and ******s in
> >twenty out of twenty trials.
> > - compute-0-1, GPU ID 3. Incorrect energies, NaNs, and ******s in
> >five out of twenty trials.
> > - compute-0-2, GPU ID 2. MD froze at 445,000 steps in one of the
> >twenty trials.
> > - compute-0-3, GPU ID 0. Incorrect energies in one out of twenty
> >trials.
> > - compute-0-4, GPU ID 2. Incorrect energies in one out of twenty
> >trials.
> > - compute-0-5, GPU ID 0. Incorrect energies, NaNs, and ******s in
> >fifteen out of twenty trials.
> >
> >_______________________________________________
> >AMBER-Developers mailing list
> >AMBER-Developers.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 16:00:03 PDT