Re: [AMBER-Developers] Parallel Test failures with CUDA 5.5

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 05 Feb 2014 18:08:20 -0800

2014/02/05/ 16:10 PST
git clone gitosis.git.ambermd.org:amber.git
cd amber
export AMBERHOME=`pwd`
./configure -cuda gnu
make -j8 install

nvcc 5.0 v0.2.1221
NVIDIA Driver 325.15
gcc 4.6.1
mpich2_eth/1.5
GTX-Titan GPUs

cd test

CUDA 5.0

2 GPU - PEER TO PEER (Devices 0,1)
-----
export CUDA_VISIBLE_DEVICES=0,1
export DO_PARALLEL='mpirun -np 2'
export TESTsander=$AMBERHOME/bin/pmemd.cuda.MPI
./test_amber_cuda_parallel.sh SPFP

66 file comparisons passed
19 file comparisons failed

Diffs are all minor


2GPU - NO PEER TO PEER (Devices 0,2)
-----
export CUDA_VISIBLE_DEVICES=0,2
export DO_PARALLEL='mpirun -np 2'
export TESTsander=$AMBERHOME/bin/pmemd.cuda.MPI
./test_amber_cuda_parallel.sh SPFP

66 file comparisons passed
19 file comparisons failed


4GPU - NO PEER TO PEER (Devices 0,1,2,3)
-----
export CUDA_VISIBLE_DEVICES=0,1,2,3
export DO_PARALLEL='mpirun -np 4'
export TESTsander=$AMBERHOME/bin/pmemd.cuda.MPI
./test_amber_cuda_parallel.sh SPFP

All extra point tests in NPT blow up.
cd tip4pew/ && ./Run.tip4pew_box_npt_mcbar SPFP
/cbio/jclab/home/rcw/amber-dev/amber/include/netcdf.mod
*** glibc detected ***
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI: free(): invalid
next size (fast): 0x0000000004d728b0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3e45c76126]
/lib64/libc.so.6[0x3e45c78c53]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x54bd82]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x52ff97]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x4c08dc]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x4c103d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3e45c1ecdd]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x40701d]
======= Memory map: ========
00400000-00cbd000 r-xp 00000000 00:19 67475747
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda_SPFP.MPI
00ebd000-00f10000 rw-p 008bd000 00:19 67475747
/cbio/jclab/home/rcw/amber-dev/a


Cellulose runs also crash
cd cellulose/ && ./Run.cellulose_nvt_256_128_128 SPFP
/cbio/jclab/home/rcw/amber-dev/amber/include/netcdf.mod
*** glibc detected ***
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI: free(): invalid
next size (normal): 0x0000000004b79c10 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3e45c76126]
/lib64/libc.so.6[0x3e45c78c53]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x548e7e]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x548f40]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x54bd5e]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x52ff97]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x4c08dc]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x4c103d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3e45c1ecdd]
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda.MPI[0x40701d]
======= Memory map: ========
00400000-00cbd000 r-xp 00000000 00:19 67475747
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda_SPFP.MPI
00ebd000-00f10000 rw-p 008bd000 00:19 67475747
/cbio/jclab/home/rcw/amber-dev/amber/bin/pmemd.cuda_SPFP.MPI

Remaining tests appear to work with just small differences. - Bugzilla Bug
filed for the 4 GPU extra point NPT bug.

64 file comparisons passed
15 file comparisons failed
12 tests experienced errors


CUDA 5.5


Single GPU
106 file comparisons passed
17 file comparisons failed
All minor diffs.



2 GPU - PEER TO PEER (Devices 0,1)
-----
59 file comparisons passed

26 file comparisons failed
All minor diffs.


2 GPU - NO PEER TO PEER (Devices 0,2)
59 file comparisons passed
26 file comparisons failed
All minor diffs.


4 GPUs
Same problem as CUDA 5.0



So it looks like the problem is NOT cuda5.5 related but rather a bug in
the parallel GPU code that may be showing up on 2 GPUs elsewhere or
differently with different MPIs.

Dan, what are your specs for the problems you are seeing?

All the best
Ross


On 2/5/14, 2:33 PM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:

>Hi All,
>
>Has anyone seen really egregious test failures using
>pmemd.cuda.MPI/cuda5.5
>compiled from the GIT tree (updated today)? I'm getting some insane
>differences and '***' in energy fields (see below for an example, full
>test
>diffs attached). I do not see this problem with pmemd.cuda/cuda5.5 or
>pmemd.cuda.MPI/cuda5.0 (those diffs are attached as well and seem OK).
>This
>was compiled using GNU 4.8.2 compilers.
>
>Not sure if this means anything, but most of the failures seem to be with
>PME; the only GB stuff that fails is AMD-related.
>
>Any ideas?
>
>-Dan
>
>---------------------------------------
>possible FAILURE: check mdout.tip4pew_box_npt.dif
>/mnt/b/projects/sciteam/jn6/GIT/amber-gnu/test/cuda/tip4pew
>96c96
>< NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 122.92 PRESS =
> 42.6
>> NSTEP = 1 TIME(PS) = 0.002 TEMP(K) = 128.19 PRESS =
> 43.5
><snip>
>426c426
>< NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = 38.69 PRESS =
>659.4
>> NSTEP = 40 TIME(PS) = 0.080 TEMP(K) = NaN PRESS =
> NaN
>427c427
>< Etot = 18.6535 EKtot = 231.6979 EPtot =
>240.1483
>> Etot = NaN EKtot = NaN EPtot =
> NaN
>428c428
>< BOND = 0.6316 ANGLE = 1.2182 DIHED =
>0.3663
>> BOND = ************** ANGLE = 361.5186 DIHED =
>5.4026
>429c429
>< 1-4 NB = 0.8032 1-4 EEL = 1.3688 VDWAALS =
>100.3454
>> 1-4 NB = ************** 1-4 EEL = ************** VDWAALS =
> NaN
>430c430
>< EELEC = 222.4484 EHBOND = 0. RESTRAINT = 0.
>> EELEC = NaN EHBOND = 0. RESTRAINT = 0.
>431c431
>< EKCMT = 131.0089 VIRIAL = 699.4621 VOLUME =
>192.3578
>> EKCMT = 1278.0524 VIRIAL = NaN VOLUME =
> NaN
>432c432
>< Density =
>0.0030
>> Density =
> NaN
>### Maximum absolute error in matching lines = 2.38e+04 at line 385 field
>3
>### Maximum relative error in matching lines = 1.55e+01 at line 257 field
>3
>
>--
>-------------------------
>Daniel R. Roe, PhD
>Department of Medicinal Chemistry
>University of Utah
>30 South 2000 East, Room 201
>Salt Lake City, UT 84112-5820
>http://home.chpc.utah.edu/~cheatham/
>(801) 587-9652
>(801) 585-6208 (Fax)
>_______________________________________________
>AMBER-Developers mailing list
>AMBER-Developers.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber-developers



_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Feb 05 2014 - 18:30:03 PST
Custom Search