Re: [AMBER-Developers] Parallel Test failures with CUDA 5.5 from Daniel Roe on 2014-02-06 (Amber Developers Archive Feb 2014)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Thu, 6 Feb 2014 17:19:20 -0700

OK, so this is a case of me having too many GIT trees, some of which were a
few (crucial) commits behind. I think missing
commit 0f967a5ad0326c7e61d4cabdc241c7662f64eee0 (2014-01-31 RCW: Fix Monte
Carlo Barostat...) was what was causing the really bad failures. I was
suspicious of the inconsistent failures between BW and stampede, so I did a
compile of a fresh clone on stampede with cuda 5.0, after which the
parallel test results look much better:

***Cuda 5.0 Parallel (stampede)***
57 file comparisons passed
28 file comparisons failed (minor diffs)
0 tests experienced errors

Did the same thing on BW with cuda 5.5:

***Cuda 5.5 Parallel (BW)***
59 file comparisons passed
26 file comparisons failed (minor diffs)
0 tests experienced errors

So on both machines I had 1 directory I forgot to pull after Jan 31, and
those ended up being the *bad* branches. The ones I did remember to pull
were the *good* branches.

However, I do see some issues with 4 GPUs (on stampede, havent tested BW
yet). 'cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar' passes for me, but when
I hit 'cd tip5p/ && ./Run.tip5p_box_nvt' I get:

UNKNOWN
cudaFree GpuBuffer::Deallocate failed unknown error

and everything hangs.

Also, it seems like it's been a while since the CUDA REMD tests have been
run (if ever). The GB 2 rep test has too small a cutoff set, and the .save
files for the different precision models appear to be missing. I'll work on
getting these tests up-to-speed. Any guidelines on what should be the 'gold
standard' for GPU test output?

-Dan

PS - We may need to re-think how the 'numprocs' program works. The parallel
executor at some sites (e.g. ibrun, aprun, etc) prints extra stuff to
STDOUT which breaks tests that use numprocs output. For example, 'ibrun -n
2 -o 0 ./numprocs' gives you:

TACC: Starting up job 2733053
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
2
TACC: Shutdown complete. Exiting.

We could either try putting a 'tag' in front of the output like
'AMBERNPROCS' and grepping that (probably the easiest solution) or just
have numprocs write to a file...

On Thu, Feb 6, 2014 at 10:42 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:

> Hi,
>
> Here are some runs I just did on stampede GPU nodes (Tesla K20m), intel
> 13.0.2, mvapich2/1.9a2.
>
> ***Cuda 5.0 - Serial***
> 98 file comparisons passed
> 18 file comparisons failed (minor diffs)
> 6 tests experienced errors (see below)
>
> The errors (really just 2 that count as 3 each) are from the following:
>
> cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar SPFP
> /home1/00301/tg455746/GIT/amber/include/netcdf.mod
> ERROR: Calculation halted. Periodic box dimensions have changed too much
> from their initial values.
>
> This happens for 'cd tip4pew/ && ./Run.tip4pew_oct_npt_mcbar' as well.
>
> ***Cuda 5.0 - Parallel***
> 47 file comparisons passed
> 30 file comparisons failed
> 14 tests experienced errors
>
> Here the diffs are major again (some absolute errors on the order of
> 10^2-10^5!), similar to what I was seeing on BW with Cuda 5.5. I ran it
> twice just to be sure and got the same exact diffs both times. Errors are
> from 3 tests:
> 'cd tip4pew/ && ./Run.tip4pew_box_npt' (| ERROR: max pairlist cutoff
> must be less than unit cell max sphere radius!)
>
> 'cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar' (unspecified launch failure,
> probably related to what happens in serial)
>
> 'cd tip4pew/ && ./Run.tip4pew_oct_npt_mcbar' (same as serial)
>
> ***Cuda 5.5 - Parallel***
> 55 file comparisons passed
> 30 file comparisons failed
> 0 tests experienced errors
>
> All diffs appear minor. So now I have the case where the Cuda 5.5-compiled
> parallel code is behaving well, which is a bit maddening, but supports your
> idea that the bug is not Cuda version-specific. Note too the
> tip4pew-related errors are gone. I ran these tests twice as well and the
> diffs didn't quite match; I think it may be innocuous though:
>
> < > Etot = 0.1629 EKtot = 54.6589 EPtot =
> 54.6898
> < ### Maximum absolute error in matching lines = 1.00e-04 at line 206
> field 3
> < ### Maximum relative error in matching lines = 6.14e-04 at line 206
> field 3
> ---
> > > Etot = 0.1628 EKtot = 54.6589 EPtot =
> 54.6898
> > ### Maximum absolute error in matching lines = 2.00e-04 at line 206
> field 3
> > ### Maximum relative error in matching lines = 1.23e-03 at line 206
> field 3
>
> ***Cuda 5.5 Serial***
> 95 file comparisons passed
> 28 file comparisons failed
> 0 tests experienced errors
>
> Again, the serial code performs well, minor diffs, and the tip4pew-errors
> are not present.
>
> So it seems like whatever issue that causes the tip4pew tests to fail in
> serial may be related to the problems in parallel; i.e. failure of those
> tests in serial is a predictor for more massive failures in parallel. I
> will continue testing; in particular I'm going to try and reproduce my
> 'good' tests on BW with cuda 5.0 after a re-compile. If you can think of
> any other tests I should run let me know.
>
> Take care,
>
> -Dan
>
>
> On Thu, Feb 6, 2014 at 9:38 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Dan,
>>
>> Note there is almost certainly an underlying overflow on an array
>> somewhere in parallel. I would not be surprised if it is not hidden on 2
>> GPUs in the same box and only when using pinned memory and multiple nodes
>> does it show up on just 2 GPUs. So it's probably not a BlueWaters specific
>> thing (although I would not be surprised if it was).
>>
>> I'll add this to the bug report and we'll figure out what is going wrong.
>> Note we pretty much don't test on any systems where GPUs are in different
>> nodes (except REMD) these days since peer to peer is so good and the
>> interconnects suck in comparison. So BlueWaters is probably the only place
>> where such runs will get tested. So please try every combination you can
>> on the run up to release.
>>
>> All the best
>> Ross
>>
>>
>>
>> On 2/5/14, 8:46 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
>>
>> >Hi,
>> >
>> >On Wed, Feb 5, 2014 at 7:08 PM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>> >
>> >> So it looks like the problem is NOT cuda5.5 related but rather a bug in
>> >> the parallel GPU code that may be showing up on 2 GPUs elsewhere or
>> >> differently with different MPIs.
>> >>
>> >> Dan, what are your specs for the problems you are seeing?
>> >>
>> >
>> >This is running on bluewaters, 2 xk nodes (Tesla K20X cards). It could
>> >just
>> >be something weird with their installed 5.5 libraries (wouldn't be the
>> >first time I've had issues with their libs). I will try and test this on
>> >some of our local GPUs tomorrow; I would do it now but my internet
>> >connection has been going in and out at my house tonight and it's tough
>> to
>> >write scripts when the terminal keeps disconnecting...
>> >
>> >One question: are all the GPUs you are testing in the same box? If so,
>> >maybe it's something to do with actually having to go across a network
>> >device?
>> >
>> >I'll let you know what I find tomorrow. Take care,
>> >
>> >-Dan
>> >
>> >
>> >>
>> >> All the best
>> >> Ross
>> >>
>> >>
>> >> On 2/5/14, 2:33 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:
>> >>
>> >> >Hi All,
>> >> >
>> >> >Has anyone seen really egregious test failures using
>> >> >pmemd.cuda.MPI/cuda5.5
>> >> >compiled from the GIT tree (updated today)? I'm getting some insane
>> >> >differences and '***' in energy fields (see below for an example, full
>> >> >test
>> >> >diffs attached). I do not see this problem with pmemd.cuda/cuda5.5 or
>> >> >pmemd.cuda.MPI/cuda5.0 (those diffs are attached as well and seem OK).
>> >> >This
>> >> >was compiled using GNU 4.8.2 compilers.
>> >> >
>> >> >Not sure if this means anything, but most of the failures seem to be
>> >>with
>> >> >PME; the only GB stuff that fails is AMD-related.
>> >> >
>> >> >Any ideas?
>> >> >
>> >> >-Dan
>> >> >
>> >> >---------------------------------------
>> >> >possible FAILURE: check mdout.tip4pew_box_npt.dif
>> >> >/mnt/b/projects/sciteam/jn6/GIT/amber-gnu/test/cuda/tip4pew
>> >> >96c96
>> >> >< NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 122.92
>> >>PRESS =
>> >> > 42.6
>> >> >> NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 128.19
>> >>PRESS =
>> >> > 43.5
>> >> ><snip>
>> >> >426c426
>> >> >< NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = 38.69
>> >>PRESS =
>> >> >659.4
>> >> >> NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = NaN
>> >>PRESS =
>> >> > NaN
>> >> >427c427
>> >> >< Etot = 18.6535 EKtot = 231.6979 EPtot =
>> >> >240.1483
>> >> >> Etot = NaN EKtot = NaN EPtot =
>> >> > NaN
>> >> >428c428
>> >> >< BOND = 0.6316 ANGLE = 1.2182 DIHED =
>> >> >0.3663
>> >> >> BOND = ************** ANGLE = 361.5186 DIHED =
>> >> >5.4026
>> >> >429c429
>> >> >< 1-4 NB = 0.8032 1-4 EEL = 1.3688 VDWAALS =
>> >> >100.3454
>> >> >> 1-4 NB = ************** 1-4 EEL = ************** VDWAALS =
>> >> > NaN
>> >> >430c430
>> >> >< EELEC = 222.4484 EHBOND = 0. RESTRAINT =
>> >>0.
>> >> >> EELEC = NaN EHBOND = 0. RESTRAINT =
>> >>0.
>> >> >431c431
>> >> >< EKCMT = 131.0089 VIRIAL = 699.4621 VOLUME =
>> >> >192.3578
>> >> >> EKCMT = 1278.0524 VIRIAL = NaN VOLUME =
>> >> > NaN
>> >> >432c432
>> >> >< Density =
>> >> >0.0030
>> >> >> Density =
>> >> > NaN
>> >> >### Maximum absolute error in matching lines = 2.38e+04 at line 385
>> >>field
>> >> >3
>> >> >### Maximum relative error in matching lines = 1.55e+01 at line 257
>> >>field
>> >> >3
>> >> >
>> >> >--
>> >> >-------------------------
>> >> >Daniel R. Roe, PhD
>> >> >Department of Medicinal Chemistry
>> >> >University of Utah
>> >> >30 South 2000 East, Room 201
>> >> >Salt Lake City, UT 84112-5820
>> >> >http://home.chpc.utah.edu/~cheatham/
>> >> >(801) 587-9652
>> >> >(801) 585-6208 (Fax)
>> >> >_______________________________________________
>> >> >AMBER-Developers mailing list
>> >> >AMBER-Developers.ambermd.org
>> >> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> AMBER-Developers mailing list
>> >> AMBER-Developers.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber-developers
>> >>
>> >
>> >
>> >
>> >--
>> >-------------------------
>> >Daniel R. Roe, PhD
>> >Department of Medicinal Chemistry
>> >University of Utah
>> >30 South 2000 East, Room 201
>> >Salt Lake City, UT 84112-5820
>> >http://home.chpc.utah.edu/~cheatham/
>> >(801) 587-9652
>> >(801) 585-6208 (Fax)
>> >_______________________________________________
>> >AMBER-Developers mailing list
>> >AMBER-Developers.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
>
>
> --
> -------------------------
> Daniel R. Roe, PhD
> Department of Medicinal Chemistry
> University of Utah
> 30 South 2000 East, Room 201
> Salt Lake City, UT 84112-5820
> http://home.chpc.utah.edu/~cheatham/
> (801) 587-9652
> (801) 585-6208 (Fax)
>

-- 
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 201
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers

Received on Thu Feb 06 2014 - 16:30:03 PST