Hi,
Here are some runs I just did on stampede GPU nodes (Tesla K20m), intel
13.0.2, mvapich2/1.9a2.
***Cuda 5.0 - Serial***
98 file comparisons passed
18 file comparisons failed (minor diffs)
6 tests experienced errors (see below)
The errors (really just 2 that count as 3 each) are from the following:
cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar SPFP
/home1/00301/tg455746/GIT/amber/include/netcdf.mod
ERROR: Calculation halted. Periodic box dimensions have changed too much
from their initial values.
This happens for 'cd tip4pew/ && ./Run.tip4pew_oct_npt_mcbar' as well.
***Cuda 5.0 - Parallel***
47 file comparisons passed
30 file comparisons failed
14 tests experienced errors
Here the diffs are major again (some absolute errors on the order of
10^2-10^5!), similar to what I was seeing on BW with Cuda 5.5. I ran it
twice just to be sure and got the same exact diffs both times. Errors are
from 3 tests:
'cd tip4pew/ && ./Run.tip4pew_box_npt' (| ERROR: max pairlist cutoff must
be less than unit cell max sphere radius!)
'cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar' (unspecified launch failure,
probably related to what happens in serial)
'cd tip4pew/ && ./Run.tip4pew_oct_npt_mcbar' (same as serial)
***Cuda 5.5 - Parallel***
55 file comparisons passed
30 file comparisons failed
0 tests experienced errors
All diffs appear minor. So now I have the case where the Cuda 5.5-compiled
parallel code is behaving well, which is a bit maddening, but supports your
idea that the bug is not Cuda version-specific. Note too the
tip4pew-related errors are gone. I ran these tests twice as well and the
diffs didn't quite match; I think it may be innocuous though:
< > Etot = 0.1629 EKtot = 54.6589 EPtot =
54.6898
< ### Maximum absolute error in matching lines = 1.00e-04 at line 206 field
3
< ### Maximum relative error in matching lines = 6.14e-04 at line 206 field
3
---
> > Etot = 0.1628 EKtot = 54.6589 EPtot =
54.6898
> ### Maximum absolute error in matching lines = 2.00e-04 at line 206 field
3
> ### Maximum relative error in matching lines = 1.23e-03 at line 206 field
3
***Cuda 5.5 Serial***
95 file comparisons passed
28 file comparisons failed
0 tests experienced errors
Again, the serial code performs well, minor diffs, and the tip4pew-errors
are not present.
So it seems like whatever issue that causes the tip4pew tests to fail in
serial may be related to the problems in parallel; i.e. failure of those
tests in serial is a predictor for more massive failures in parallel. I
will continue testing; in particular I'm going to try and reproduce my
'good' tests on BW with cuda 5.0 after a re-compile. If you can think of
any other tests I should run let me know.
Take care,
-Dan
On Thu, Feb 6, 2014 at 9:38 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Dan,
>
> Note there is almost certainly an underlying overflow on an array
> somewhere in parallel. I would not be surprised if it is not hidden on 2
> GPUs in the same box and only when using pinned memory and multiple nodes
> does it show up on just 2 GPUs. So it's probably not a BlueWaters specific
> thing (although I would not be surprised if it was).
>
> I'll add this to the bug report and we'll figure out what is going wrong.
> Note we pretty much don't test on any systems where GPUs are in different
> nodes (except REMD) these days since peer to peer is so good and the
> interconnects suck in comparison. So BlueWaters is probably the only place
> where such runs will get tested. So please try every combination you can
> on the run up to release.
>
> All the best
> Ross
>
>
>
> On 2/5/14, 8:46 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
>
> >Hi,
> >
> >On Wed, Feb 5, 2014 at 7:08 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> So it looks like the problem is NOT cuda5.5 related but rather a bug in
> >> the parallel GPU code that may be showing up on 2 GPUs elsewhere or
> >> differently with different MPIs.
> >>
> >> Dan, what are your specs for the problems you are seeing?
> >>
> >
> >This is running on bluewaters, 2 xk nodes (Tesla K20X cards). It could
> >just
> >be something weird with their installed 5.5 libraries (wouldn't be the
> >first time I've had issues with their libs). I will try and test this on
> >some of our local GPUs tomorrow; I would do it now but my internet
> >connection has been going in and out at my house tonight and it's tough to
> >write scripts when the terminal keeps disconnecting...
> >
> >One question: are all the GPUs you are testing in the same box? If so,
> >maybe it's something to do with actually having to go across a network
> >device?
> >
> >I'll let you know what I find tomorrow. Take care,
> >
> >-Dan
> >
> >
> >>
> >> All the best
> >> Ross
> >>
> >>
> >> On 2/5/14, 2:33 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
> >>
> >> >Hi All,
> >> >
> >> >Has anyone seen really egregious test failures using
> >> >pmemd.cuda.MPI/cuda5.5
> >> >compiled from the GIT tree (updated today)? I'm getting some insane
> >> >differences and '***' in energy fields (see below for an example, full
> >> >test
> >> >diffs attached). I do not see this problem with pmemd.cuda/cuda5.5 or
> >> >pmemd.cuda.MPI/cuda5.0 (those diffs are attached as well and seem OK).
> >> >This
> >> >was compiled using GNU 4.8.2 compilers.
> >> >
> >> >Not sure if this means anything, but most of the failures seem to be
> >>with
> >> >PME; the only GB stuff that fails is AMD-related.
> >> >
> >> >Any ideas?
> >> >
> >> >-Dan
> >> >
> >> >---------------------------------------
> >> >possible FAILURE: check mdout.tip4pew_box_npt.dif
> >> >/mnt/b/projects/sciteam/jn6/GIT/amber-gnu/test/cuda/tip4pew
> >> >96c96
> >> >< NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 122.92
> >>PRESS =
> >> > 42.6
> >> >> NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 128.19
> >>PRESS =
> >> > 43.5
> >> ><snip>
> >> >426c426
> >> >< NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = 38.69
> >>PRESS =
> >> >659.4
> >> >> NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = NaN
> >>PRESS =
> >> > NaN
> >> >427c427
> >> >< Etot = 18.6535 EKtot = 231.6979 EPtot =
> >> >240.1483
> >> >> Etot = NaN EKtot = NaN EPtot =
> >> > NaN
> >> >428c428
> >> >< BOND = 0.6316 ANGLE = 1.2182 DIHED =
> >> >0.3663
> >> >> BOND = ************** ANGLE = 361.5186 DIHED =
> >> >5.4026
> >> >429c429
> >> >< 1-4 NB = 0.8032 1-4 EEL = 1.3688 VDWAALS =
> >> >100.3454
> >> >> 1-4 NB = ************** 1-4 EEL = ************** VDWAALS =
> >> > NaN
> >> >430c430
> >> >< EELEC = 222.4484 EHBOND = 0. RESTRAINT =
> >>0.
> >> >> EELEC = NaN EHBOND = 0. RESTRAINT =
> >>0.
> >> >431c431
> >> >< EKCMT = 131.0089 VIRIAL = 699.4621 VOLUME =
> >> >192.3578
> >> >> EKCMT = 1278.0524 VIRIAL = NaN VOLUME =
> >> > NaN
> >> >432c432
> >> >< Density =
> >> >0.0030
> >> >> Density =
> >> > NaN
> >> >### Maximum absolute error in matching lines = 2.38e+04 at line 385
> >>field
> >> >3
> >> >### Maximum relative error in matching lines = 1.55e+01 at line 257
> >>field
> >> >3
> >> >
> >> >--
> >> >-------------------------
> >> >Daniel R. Roe, PhD
> >> >Department of Medicinal Chemistry
> >> >University of Utah
> >> >30 South 2000 East, Room 201
> >> >Salt Lake City, UT 84112-5820
> >> >http://home.chpc.utah.edu/~cheatham/
> >> >(801) 587-9652
> >> >(801) 585-6208 (Fax)
> >> >_______________________________________________
> >> >AMBER-Developers mailing list
> >> >AMBER-Developers.ambermd.org
> >> >http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER-Developers mailing list
> >> AMBER-Developers.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
> >>
> >
> >
> >
> >--
> >-------------------------
> >Daniel R. Roe, PhD
> >Department of Medicinal Chemistry
> >University of Utah
> >30 South 2000 East, Room 201
> >Salt Lake City, UT 84112-5820
> >http://home.chpc.utah.edu/~cheatham/
> >(801) 587-9652
> >(801) 585-6208 (Fax)
> >_______________________________________________
> >AMBER-Developers mailing list
> >AMBER-Developers.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
--
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 201
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Feb 06 2014 - 10:00:03 PST