Re: [AMBER-Developers] Parallel Test failures with CUDA 5.5

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 06 Feb 2014 08:38:16 -0800

Hi Dan,

Note there is almost certainly an underlying overflow on an array
somewhere in parallel. I would not be surprised if it is not hidden on 2
GPUs in the same box and only when using pinned memory and multiple nodes
does it show up on just 2 GPUs. So it's probably not a BlueWaters specific
thing (although I would not be surprised if it was).

I'll add this to the bug report and we'll figure out what is going wrong.
Note we pretty much don't test on any systems where GPUs are in different
nodes (except REMD) these days since peer to peer is so good and the
interconnects suck in comparison. So BlueWaters is probably the only place
where such runs will get tested. So please try every combination you can
on the run up to release.

All the best
Ross



On 2/5/14, 8:46 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:

>Hi,
>
>On Wed, Feb 5, 2014 at 7:08 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> So it looks like the problem is NOT cuda5.5 related but rather a bug in
>> the parallel GPU code that may be showing up on 2 GPUs elsewhere or
>> differently with different MPIs.
>>
>> Dan, what are your specs for the problems you are seeing?
>>
>
>This is running on bluewaters, 2 xk nodes (Tesla K20X cards). It could
>just
>be something weird with their installed 5.5 libraries (wouldn't be the
>first time I've had issues with their libs). I will try and test this on
>some of our local GPUs tomorrow; I would do it now but my internet
>connection has been going in and out at my house tonight and it's tough to
>write scripts when the terminal keeps disconnecting...
>
>One question: are all the GPUs you are testing in the same box? If so,
>maybe it's something to do with actually having to go across a network
>device?
>
>I'll let you know what I find tomorrow. Take care,
>
>-Dan
>
>
>>
>> All the best
>> Ross
>>
>>
>> On 2/5/14, 2:33 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
>>
>> >Hi All,
>> >
>> >Has anyone seen really egregious test failures using
>> >pmemd.cuda.MPI/cuda5.5
>> >compiled from the GIT tree (updated today)? I'm getting some insane
>> >differences and '***' in energy fields (see below for an example, full
>> >test
>> >diffs attached). I do not see this problem with pmemd.cuda/cuda5.5 or
>> >pmemd.cuda.MPI/cuda5.0 (those diffs are attached as well and seem OK).
>> >This
>> >was compiled using GNU 4.8.2 compilers.
>> >
>> >Not sure if this means anything, but most of the failures seem to be
>>with
>> >PME; the only GB stuff that fails is AMD-related.
>> >
>> >Any ideas?
>> >
>> >-Dan
>> >
>> >---------------------------------------
>> >possible FAILURE: check mdout.tip4pew_box_npt.dif
>> >/mnt/b/projects/sciteam/jn6/GIT/amber-gnu/test/cuda/tip4pew
>> >96c96
>> >< NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 122.92
>>PRESS =
>> > 42.6
>> >> NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 128.19
>>PRESS =
>> > 43.5
>> ><snip>
>> >426c426
>> >< NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = 38.69
>>PRESS =
>> >659.4
>> >> NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = NaN
>>PRESS =
>> > NaN
>> >427c427
>> >< Etot = 18.6535 EKtot = 231.6979 EPtot =
>> >240.1483
>> >> Etot = NaN EKtot = NaN EPtot =
>> > NaN
>> >428c428
>> >< BOND = 0.6316 ANGLE = 1.2182 DIHED =
>> >0.3663
>> >> BOND = ************** ANGLE = 361.5186 DIHED =
>> >5.4026
>> >429c429
>> >< 1-4 NB = 0.8032 1-4 EEL = 1.3688 VDWAALS =
>> >100.3454
>> >> 1-4 NB = ************** 1-4 EEL = ************** VDWAALS =
>> > NaN
>> >430c430
>> >< EELEC = 222.4484 EHBOND = 0. RESTRAINT =
>>0.
>> >> EELEC = NaN EHBOND = 0. RESTRAINT =
>>0.
>> >431c431
>> >< EKCMT = 131.0089 VIRIAL = 699.4621 VOLUME =
>> >192.3578
>> >> EKCMT = 1278.0524 VIRIAL = NaN VOLUME =
>> > NaN
>> >432c432
>> >< Density =
>> >0.0030
>> >> Density =
>> > NaN
>> >### Maximum absolute error in matching lines = 2.38e+04 at line 385
>>field
>> >3
>> >### Maximum relative error in matching lines = 1.55e+01 at line 257
>>field
>> >3
>> >
>> >--
>> >-------------------------
>> >Daniel R. Roe, PhD
>> >Department of Medicinal Chemistry
>> >University of Utah
>> >30 South 2000 East, Room 201
>> >Salt Lake City, UT 84112-5820
>> >http://home.chpc.utah.edu/~cheatham/
>> >(801) 587-9652
>> >(801) 585-6208 (Fax)
>> >_______________________________________________
>> >AMBER-Developers mailing list
>> >AMBER-Developers.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
>
>
>--
>-------------------------
>Daniel R. Roe, PhD
>Department of Medicinal Chemistry
>University of Utah
>30 South 2000 East, Room 201
>Salt Lake City, UT 84112-5820
>http://home.chpc.utah.edu/~cheatham/
>(801) 587-9652
>(801) 585-6208 (Fax)
>_______________________________________________
>AMBER-Developers mailing list
>AMBER-Developers.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber-developers



_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Feb 06 2014 - 09:00:02 PST
Custom Search