Re: [AMBER-Developers] Parallel Test failures with CUDA 5.5

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Thu, 6 Feb 2014 10:42:30 -0700

Hi,

Here are some runs I just did on stampede GPU nodes (Tesla K20m), intel
13.0.2, mvapich2/1.9a2.

***Cuda 5.0 - Serial***
98 file comparisons passed
18 file comparisons failed (minor diffs)
6 tests experienced errors (see below)

The errors (really just 2 that count as 3 each) are from the following:

cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar SPFP
/home1/00301/tg455746/GIT/amber/include/netcdf.mod
ERROR: Calculation halted. Periodic box dimensions have changed too much
from their initial values.

This happens for 'cd tip4pew/ && ./Run.tip4pew_oct_npt_mcbar' as well.

***Cuda 5.0 - Parallel***
47 file comparisons passed
30 file comparisons failed
14 tests experienced errors

Here the diffs are major again (some absolute errors on the order of
10^2-10^5!), similar to what I was seeing on BW with Cuda 5.5. I ran it
twice just to be sure and got the same exact diffs both times. Errors are
from 3 tests:
'cd tip4pew/ && ./Run.tip4pew_box_npt' (| ERROR: max pairlist cutoff must
be less than unit cell max sphere radius!)

'cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar' (unspecified launch failure,
probably related to what happens in serial)

'cd tip4pew/ && ./Run.tip4pew_oct_npt_mcbar' (same as serial)

***Cuda 5.5 - Parallel***
55 file comparisons passed
30 file comparisons failed
0 tests experienced errors

All diffs appear minor. So now I have the case where the Cuda 5.5-compiled
parallel code is behaving well, which is a bit maddening, but supports your
idea that the bug is not Cuda version-specific. Note too the
tip4pew-related errors are gone. I ran these tests twice as well and the
diffs didn't quite match; I think it may be innocuous though:

< > Etot = 0.1629 EKtot = 54.6589 EPtot =
 54.6898
< ### Maximum absolute error in matching lines = 1.00e-04 at line 206 field
3
< ### Maximum relative error in matching lines = 6.14e-04 at line 206 field
3
---
> >  Etot   =         0.1628  EKtot   =        54.6589  EPtot      =
 54.6898
> ### Maximum absolute error in matching lines = 2.00e-04 at line 206 field
3
> ### Maximum relative error in matching lines = 1.23e-03 at line 206 field
3
***Cuda 5.5 Serial***
95 file comparisons passed
28 file comparisons failed
0 tests experienced errors
Again, the serial code performs well, minor diffs, and the tip4pew-errors
are not present.
So it seems like whatever issue that causes the tip4pew tests to fail in
serial may be related to the problems in parallel; i.e. failure of those
tests in serial is a predictor for more massive failures in parallel. I
will continue testing; in particular I'm going to try and reproduce my
'good' tests on BW with cuda 5.0 after a re-compile. If you can think of
any other tests I should run let me know.
Take care,
-Dan
On Thu, Feb 6, 2014 at 9:38 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Dan,
>
> Note there is almost certainly an underlying overflow on an array
> somewhere in parallel. I would not be surprised if it is not hidden on 2
> GPUs in the same box and only when using pinned memory and multiple nodes
> does it show up on just 2 GPUs. So it's probably not a BlueWaters specific
> thing (although I would not be surprised if it was).
>
> I'll add this to the bug report and we'll figure out what is going wrong.
> Note we pretty much don't test on any systems where GPUs are in different
> nodes (except REMD) these days since peer to peer is so good and the
> interconnects suck in comparison. So BlueWaters is probably the only place
> where such runs will get tested. So please try every combination you can
> on the run up to release.
>
> All the best
> Ross
>
>
>
> On 2/5/14, 8:46 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
>
> >Hi,
> >
> >On Wed, Feb 5, 2014 at 7:08 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> So it looks like the problem is NOT cuda5.5 related but rather a bug in
> >> the parallel GPU code that may be showing up on 2 GPUs elsewhere or
> >> differently with different MPIs.
> >>
> >> Dan, what are your specs for the problems you are seeing?
> >>
> >
> >This is running on bluewaters, 2 xk nodes (Tesla K20X cards). It could
> >just
> >be something weird with their installed 5.5 libraries (wouldn't be the
> >first time I've had issues with their libs). I will try and test this on
> >some of our local GPUs tomorrow; I would do it now but my internet
> >connection has been going in and out at my house tonight and it's tough to
> >write scripts when the terminal keeps disconnecting...
> >
> >One question: are all the GPUs you are testing in the same box? If so,
> >maybe it's something to do with actually having to go across a network
> >device?
> >
> >I'll let you know what I find tomorrow. Take care,
> >
> >-Dan
> >
> >
> >>
> >> All the best
> >> Ross
> >>
> >>
> >> On 2/5/14, 2:33 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
> >>
> >> >Hi All,
> >> >
> >> >Has anyone seen really egregious test failures using
> >> >pmemd.cuda.MPI/cuda5.5
> >> >compiled from the GIT tree (updated today)? I'm getting some insane
> >> >differences and '***' in energy fields (see below for an example, full
> >> >test
> >> >diffs attached). I do not see this problem with pmemd.cuda/cuda5.5 or
> >> >pmemd.cuda.MPI/cuda5.0 (those diffs are attached as well and seem OK).
> >> >This
> >> >was compiled using GNU 4.8.2 compilers.
> >> >
> >> >Not sure if this means anything, but most of the failures seem to be
> >>with
> >> >PME; the only GB stuff that fails is AMD-related.
> >> >
> >> >Any ideas?
> >> >
> >> >-Dan
> >> >
> >> >---------------------------------------
> >> >possible FAILURE:  check mdout.tip4pew_box_npt.dif
> >> >/mnt/b/projects/sciteam/jn6/GIT/amber-gnu/test/cuda/tip4pew
> >> >96c96
> >> ><  NSTEP =        1   TIME(PS) =       0.002  TEMP(K) =   122.92
> >>PRESS =
> >> > 42.6
> >> >>  NSTEP =        1   TIME(PS) =       0.002  TEMP(K) =   128.19
> >>PRESS =
> >> > 43.5
> >> ><snip>
> >> >426c426
> >> ><  NSTEP =       40   TIME(PS) =       0.080  TEMP(K) =    38.69
> >>PRESS =
> >> >659.4
> >> >>  NSTEP =       40   TIME(PS) =       0.080  TEMP(K) =      NaN
> >>PRESS =
> >> >  NaN
> >> >427c427
> >> ><  Etot   =        18.6535  EKtot   =       231.6979  EPtot      =
> >> >240.1483
> >> >>  Etot   =            NaN  EKtot   =            NaN  EPtot      =
> >> >   NaN
> >> >428c428
> >> ><  BOND   =         0.6316  ANGLE   =         1.2182  DIHED      =
> >> >0.3663
> >> >>  BOND   = **************  ANGLE   =       361.5186  DIHED      =
> >> >5.4026
> >> >429c429
> >> ><  1-4 NB =         0.8032  1-4 EEL =         1.3688  VDWAALS    =
> >> >100.3454
> >> >>  1-4 NB = **************  1-4 EEL = **************  VDWAALS    =
> >> >   NaN
> >> >430c430
> >> ><  EELEC  =       222.4484  EHBOND  =         0.  RESTRAINT  =
> >>0.
> >> >>  EELEC  =            NaN  EHBOND  =         0.  RESTRAINT  =
> >>0.
> >> >431c431
> >> ><  EKCMT  =       131.0089  VIRIAL  =       699.4621  VOLUME     =
> >> >192.3578
> >> >>  EKCMT  =      1278.0524  VIRIAL  =            NaN  VOLUME     =
> >> >   NaN
> >> >432c432
> >> ><                                                     Density    =
> >> >0.0030
> >> >>                                                     Density    =
> >> >   NaN
> >> >### Maximum absolute error in matching lines = 2.38e+04 at line 385
> >>field
> >> >3
> >> >### Maximum relative error in matching lines = 1.55e+01 at line 257
> >>field
> >> >3
> >> >
> >> >--
> >> >-------------------------
> >> >Daniel R. Roe, PhD
> >> >Department of Medicinal Chemistry
> >> >University of Utah
> >> >30 South 2000 East, Room 201
> >> >Salt Lake City, UT 84112-5820
> >> >http://home.chpc.utah.edu/~cheatham/
> >> >(801) 587-9652
> >> >(801) 585-6208 (Fax)
> >> >_______________________________________________
> >> >AMBER-Developers mailing list
> >> >AMBER-Developers.ambermd.org
> >> >http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >
> >
> >
> >--
> >-------------------------
> >Daniel R. Roe, PhD
> >Department of Medicinal Chemistry
> >University of Utah
> >30 South 2000 East, Room 201
> >Salt Lake City, UT 84112-5820
> >http://home.chpc.utah.edu/~cheatham/
> >(801) 587-9652
> >(801) 585-6208 (Fax)
> >_______________________________________________
> >AMBER-Developers mailing list
> >AMBER-Developers.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
-- 
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 201
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Feb 06 2014 - 10:00:03 PST
Custom Search