Re: [AMBER-Developers] [AMBER] CUDA test failed from Daniel Roe on 2014-10-16 (Amber Developers Archive Oct 2014)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Thu, 16 Oct 2014 12:02:19 -0600

On Thu, Oct 16, 2014 at 11:54 AM, Scott Le Grand <varelse2005.gmail.com> wrote:
> That's easily the result of running on a different GPU architecture (Fermi
> versus Kepler, SM 3.0 versus SM 3.5) or using a different toolkit than the
> system used to generate those reference files.

I figured it would be something like that. Maybe we should print some
sort of message in the CUDA SPFP test output to this effect? And also
maybe tell them to run the tests with DPFP if they are really
paranoid?

-Dan

>
> And there's zero zip nada null I can do to address that. You guys would
> have to build specific reference files for each class of GPU (GK104, GK110,
> GM204, GM107, GF1xx) x Toolkits (6.5, 6.0, 5.5, 5.0)
>
> Scott
>
> On Thu, Oct 16, 2014 at 9:38 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:
>
>> Hi,
>>
>> I know that there was some discussion of this right after the Amber 14
>> release, but I wasn't sure where we ended up on this.
>>
>> On Fri, Oct 10, 2014 at 10:25 AM, Jason Swails <jason.swails.gmail.com>
>> wrote:
>> > Unfortunately, the pmemd.cuda tests have this problem sometimes. The
>> > larger errors should occur for the stochastic tests (ntt=2 and ntt=3 --
>> the
>> > name often tells you if one of those thermostats is involved).
>> >
>> > The problem is that the random number generators are different on each
>> GPU
>> > model. As a result, it was impossible with the Amber 14 code to design a
>> > test that would give identical results on all GPUs, even if you specified
>> > the initial seed. As a result, the only cards that all tests pass for
>> are
>> > the cards that Ross used to create the test files in the first place.
>>
>> Are we sure this is the correct explanation? I ask because I see some
>> diffs that don't appear to be related to thermostat. For example, in
>> the serial CUDA minimization test for chamber/dhfr_pbc at the final
>> step the dihedral energy absolute diff is 2.33E-01:
>>
>> cuda/chamber/dhfr_pbc/mdout.dhfr_charmm_pbc_noshake_min.dif
>> .< DIHED = 739.3609
>> .> DIHED = 739.3595
>>
>> This is larger than what is usually considered "acceptable" for CPU -
>> is it OK for GPU? There are many test diffs (at least using the
>> compile from the GIT tree as of Oct. 8):
>>
>> Serial:
>> 89 file comparisons passed
>> 36 file comparisons failed
>> 0 tests experienced errors
>>
>> MPI:
>> 54 file comparisons passed
>> 33 file comparisons failed
>> 0 tests experienced errors
>>
>> I've attached a plot of the maximum absolute error grabbed from the
>> related test diff files. From my cursory inspection of the diffs
>> themselves most of this stuff does appear innocuous. In some cases the
>> diffs are in the 'RMS fluctuations section', or its only a single step
>> where e.g. the PRESS variable is off by 0.1, etc. However, I can see
>> how these results would be very alarming for an everyday user.
>>
>> If I switch to the DPFP model all serial tests pass, and all parallel
>> tests pass except for 'cnstph/explicit' (many small differences) and
>> 'lipid_npt_tests/mdout_nvt_lipid14' (1 very small diff in EPtot), so
>> this does appear to be a precision thing. I'm just wondering if there
>> isn't some way we can improve the SPFP tests so they "work". I'm
>> worried that if we get too used to seeing all of these diffs in the
>> test output it will be harder to spot an actual problem if/when it
>> arises.
>>
>> Thoughts?
>>
>> -Dan
>>
>> >
>> > So as long as the diffs appear in some kind of "ntt2" or "ntt3" test
>> > (Andersen or Langevin thermostats), and the remaining diffs are small,
>> you
>> > should be fine. FWIW, 37 sounds about the right number to me.
>> >
>> > HTH,
>> > Jason
>> >
>> > --
>> > Jason M. Swails
>> > BioMaPS,
>> > Rutgers University
>> > Postdoctoral Researcher
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> --
>> -------------------------
>> Daniel R. Roe, PhD
>> Department of Medicinal Chemistry
>> University of Utah
>> 30 South 2000 East, Room 307
>> Salt Lake City, UT 84112-5820
>> http://home.chpc.utah.edu/~cheatham/
>> (801) 587-9652
>> (801) 585-6208 (Fax)
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers

-- 
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 307
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers

Received on Thu Oct 16 2014 - 11:30:02 PDT