Re: [AMBER-Developers] [AMBER] CUDA test failed from Ross Walker on 2014-10-16 (Amber Developers Archive Oct 2014)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 16 Oct 2014 12:24:57 -0700

Hi Dan,

I've been meaning to overhaul the GPU test suite for a while now but never
get a block of undisturbed time long enough to do it. We have the option
to set ig < -1 now which will use the CPU random number generator - that
will address the tests that depend on the random stream. We also need to
tweak dacdif a bit - one thing that is needed is a way to strip out the
average and RMS values from the end of the runs and not compare them - a
lot of the larger differences are in these parts of the mdout rather than
individual lines. I also need to repeat a check of DPFP against the CPU
code - I do this every now and then for sanity but haven't had a chance to
do it in a while.

So, at some point I'll fix it - but note this will be a large patch if we
plan to release it as an update.

All the best
Ross

On 10/16/14, 11:02 AM, "Daniel Roe" <daniel.r.roe.gmail.com> wrote:

>On Thu, Oct 16, 2014 at 11:54 AM, Scott Le Grand <varelse2005.gmail.com>
>wrote:
>> That's easily the result of running on a different GPU architecture
>>(Fermi
>> versus Kepler, SM 3.0 versus SM 3.5) or using a different toolkit than
>>the
>> system used to generate those reference files.
>
>I figured it would be something like that. Maybe we should print some
>sort of message in the CUDA SPFP test output to this effect? And also
>maybe tell them to run the tests with DPFP if they are really
>paranoid?
>
>-Dan
>
>>
>> And there's zero zip nada null I can do to address that. You guys would
>> have to build specific reference files for each class of GPU (GK104,
>>GK110,
>> GM204, GM107, GF1xx) x Toolkits (6.5, 6.0, 5.5, 5.0)
>>
>> Scott
>>
>> On Thu, Oct 16, 2014 at 9:38 AM, Daniel Roe <daniel.r.roe.gmail.com>
>>wrote:
>>
>>> Hi,
>>>
>>> I know that there was some discussion of this right after the Amber 14
>>> release, but I wasn't sure where we ended up on this.
>>>
>>> On Fri, Oct 10, 2014 at 10:25 AM, Jason Swails <jason.swails.gmail.com>
>>> wrote:
>>> > Unfortunately, the pmemd.cuda tests have this problem sometimes. The
>>> > larger errors should occur for the stochastic tests (ntt=2 and ntt=3
>>>--
>>> the
>>> > name often tells you if one of those thermostats is involved).
>>> >
>>> > The problem is that the random number generators are different on
>>>each
>>> GPU
>>> > model. As a result, it was impossible with the Amber 14 code to
>>>design a
>>> > test that would give identical results on all GPUs, even if you
>>>specified
>>> > the initial seed. As a result, the only cards that all tests pass
>>>for
>>> are
>>> > the cards that Ross used to create the test files in the first place.
>>>
>>> Are we sure this is the correct explanation? I ask because I see some
>>> diffs that don't appear to be related to thermostat. For example, in
>>> the serial CUDA minimization test for chamber/dhfr_pbc at the final
>>> step the dihedral energy absolute diff is 2.33E-01:
>>>
>>> cuda/chamber/dhfr_pbc/mdout.dhfr_charmm_pbc_noshake_min.dif
>>> .< DIHED = 739.3609
>>> .> DIHED = 739.3595
>>>
>>> This is larger than what is usually considered "acceptable" for CPU -
>>> is it OK for GPU? There are many test diffs (at least using the
>>> compile from the GIT tree as of Oct. 8):
>>>
>>> Serial:
>>> 89 file comparisons passed
>>> 36 file comparisons failed
>>> 0 tests experienced errors
>>>
>>> MPI:
>>> 54 file comparisons passed
>>> 33 file comparisons failed
>>> 0 tests experienced errors
>>>
>>> I've attached a plot of the maximum absolute error grabbed from the
>>> related test diff files. From my cursory inspection of the diffs
>>> themselves most of this stuff does appear innocuous. In some cases the
>>> diffs are in the 'RMS fluctuations section', or its only a single step
>>> where e.g. the PRESS variable is off by 0.1, etc. However, I can see
>>> how these results would be very alarming for an everyday user.
>>>
>>> If I switch to the DPFP model all serial tests pass, and all parallel
>>> tests pass except for 'cnstph/explicit' (many small differences) and
>>> 'lipid_npt_tests/mdout_nvt_lipid14' (1 very small diff in EPtot), so
>>> this does appear to be a precision thing. I'm just wondering if there
>>> isn't some way we can improve the SPFP tests so they "work". I'm
>>> worried that if we get too used to seeing all of these diffs in the
>>> test output it will be harder to spot an actual problem if/when it
>>> arises.
>>>
>>> Thoughts?
>>>
>>> -Dan
>>>
>>> >
>>> > So as long as the diffs appear in some kind of "ntt2" or "ntt3" test
>>> > (Andersen or Langevin thermostats), and the remaining diffs are
>>>small,
>>> you
>>> > should be fine. FWIW, 37 sounds about the right number to me.
>>> >
>>> > HTH,
>>> > Jason
>>> >
>>> > --
>>> > Jason M. Swails
>>> > BioMaPS,
>>> > Rutgers University
>>> > Postdoctoral Researcher
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>>> --
>>> -------------------------
>>> Daniel R. Roe, PhD
>>> Department of Medicinal Chemistry
>>> University of Utah
>>> 30 South 2000 East, Room 307
>>> Salt Lake City, UT 84112-5820
>>> http://home.chpc.utah.edu/~cheatham/
>>> (801) 587-9652
>>> (801) 585-6208 (Fax)
>>>
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>>
>>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
>
>--
>-------------------------
>Daniel R. Roe, PhD
>Department of Medicinal Chemistry
>University of Utah
>30 South 2000 East, Room 307
>Salt Lake City, UT 84112-5820
>http://home.chpc.utah.edu/~cheatham/
>(801) 587-9652
>(801) 585-6208 (Fax)
>
>_______________________________________________
>AMBER-Developers mailing list
>AMBER-Developers.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber-developers

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Oct 16 2014 - 12:30:03 PDT