Re: [AMBER-Developers] Reliability of GPU calculations

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 26 Mar 2014 15:59:41 -0700

Hi Robert,


Note I wouldn't worry too much about the odd error causing major issues
with MD. MD is ultimately stochastic so I'd be surprised if the
problematic GPUs invalidate results in anyway. As in resulting in an
incorrect scientific conclusion. More though it just becomes annoying when
runs crash frequently. The main thing to do is to make sure the GPUs are
good from the outset since that catches a lot of faulty GeForce cards
(since NVIDIA only burns them in for 15 minutes).

Note I've seen CPUs do the same thing here - the difference is it's harder
to detect since we don't have reproducibility so it just manifests itself
in a node crashing occasionally. In serial it's deterministic but just
running serial doesn't load a CPU up enough to push it's thermal limit.

All the best
Ross

On 3/26/14, 3:50 PM, "Robert Konecny" <rok.ucsd.edu> wrote:

>Hi Ross,
>
>this is very useful - thanks. I'm not too worried about the failed cards,
>the vendor will be replacing them for us eventough they are 1.5y old. But
>we will have to incorporate some kind of regular hardware validation
>protocol on our clusters - as you are suggesting. We will be
>automatically
>scheduling this every month - this should at least give us a smaller
>window
>when we can detect a failure. We will be also advising our users to do
>the
>same - validate the hardware from time to time, not just on our local
>clusters but on any GPU hardware they have access to. But I suspect this
>will be an uphill battle to educate our users about this issue ...
>
>Having a standalone comprehensive validation suite would be great!
>
>Thanks!
>
>Robert
>
>On Wed, Mar 26, 2014 at 03:18:46PM -0700, Ross Walker wrote:
>> Hi Robert,
>>
>> Yeap that's definitely faulty CPUs. I'd see if they are still under
>> warranty. If the GPUs weren't tested at the time of installation it's
>> possible some of them were marginal to begin with and have got worse
>>over
>> time. My recommendation is to test the GPUs for approximately 48 hours
>>at
>> the time of installation and then repeat that maybe once every 2 or 3
>> months or so (just the 20 runs is probably good here) and also to run a
>> test on a GPU whenever you see behavior that might indicate an issue -
>> such as random launch errors or NANs on simulations you know to be good.
>>
>> I am not overly surprised with the 580s. Both the 480 Fermi and 580
>>Fermi
>> chips run exceedingly hot (there have also been C2050's that have failed
>> as well - the annoying thing is ECC doesn't pick it up most of the time
>> either).
>>
>> What cases do you have them in? Is it ducted fans? - With 4 GPUs in a
>>box
>> you really should have them in some kind of ducted high volume cooling
>> system. Having them in a machine room at 95F is also probably not
>>helpful
>> for their longevity but that's a whole other battle.
>>
>> I am working on making an easy to run standalone AMBER based validation
>> suite for GPU systems since most of the test cases out there aren't
>> aggressive enough to show up these types of problems.
>>
>> All the best
>> Ross
>>
>>
>>
>> On 3/26/14, 11:13 AM, "Robert Konecny" <rok.ucsd.edu> wrote:
>>
>> >Hi all,
>> >
>> >we have just discovered (thanks to the diligence of one of our users)
>> >that
>> >on our local GPU cluster 7 out of 24 GPU cards went bad. This is an
>>older
>> >(1.5y old) GPU cluster with 24 GTX580s. It is not surprising these
>>cards
>> >fail after some time (after all these are consumer grade gaming cards)
>> >but
>> >the disconcerting thing is the way they fail.
>> >
>> >This user is starting extensive Amber GPU calculations and he ran the
>> >Amber
>> >GPU validation suite on all the cards twenty times. The errors showed
>>up
>> >either as wrong energies or NaNs. However, these errors did not occur
>> >consistently but only in some of these twenty trials. Since most of our
>> >users are not running the same simulations multiple times it is very
>>hard
>> >to detect a failing card in time. The inconsistency of the errors is
>>the
>> >troublesome issue and this is very different from behavior of the
>> >CPU-bound
>> >jobs.
>> >
>> >So my question is - how do we prevent this? Should the users be running
>> >validation tests before and after each simulation? How many times? How
>> >long
>> >is long enough? Is there a better mechanism to detect GPU hardware
>>errors?
>> >
>> >What are the recommendations from the developers on this issue?
>> >
>> >Thanks,
>> >
>> >Robert
>> >
>> >
>> >PS. This is a summary of the errors:
>> >
>> > - compute-0-0, GPU ID 1. Incorrect energies in six out of twenty
>> >trials.
>> > - compute-0-0, GPU ID 3. Incorrect energies, NaNs, and ******s in
>> >twenty out of twenty trials.
>> > - compute-0-1, GPU ID 3. Incorrect energies, NaNs, and ******s in
>> >five out of twenty trials.
>> > - compute-0-2, GPU ID 2. MD froze at 445,000 steps in one of the
>> >twenty trials.
>> > - compute-0-3, GPU ID 0. Incorrect energies in one out of twenty
>> >trials.
>> > - compute-0-4, GPU ID 2. Incorrect energies in one out of twenty
>> >trials.
>> > - compute-0-5, GPU ID 0. Incorrect energies, NaNs, and ******s in
>> >fifteen out of twenty trials.
>> >
>> >_______________________________________________
>> >AMBER-Developers mailing list
>> >AMBER-Developers.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>_______________________________________________
>AMBER-Developers mailing list
>AMBER-Developers.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber-developers



_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Mar 26 2014 - 16:00:04 PDT
Custom Search