Re: [AMBER-Developers] weird behavior for pmemd.cuda on Volta cards

From: David Cerutti <dscerutti.gmail.com>
Date: Fri, 8 Dec 2017 12:42:32 -0500

That is weird--but it kind of confirms what I was thinking. Your
card--Adrian--seems to not have this problem. I noticed that your original
benchmarks were pretty consistent, regardless of system size (the numbers
of steps are in each test tuned to make them all run in a reasonable amount
of time, but not in exactly the same amount of wall clock time). If you'd
had this behavior, it would have been apparent in your earlier published
results.

I've asked Ke Li at NVIDIA if he has any thoughts--personally my hypothesis
is that the cards have an on-board temperature sensor and if they get too
hot they throttle back the cycle frequency to manage power output. In our
passively cooled system, we may just not have the right fan configuration
to keep the card running at full speed for very long.

In other news, I've got the JAC benchmark running at 770+ ns/day on GP100
if anyone downloads and tests the new master branch. I don't know if this
will transition over to Volta--the improvements that got me the extra
mileage on GP100 and GTX-1080Ti seem to be neutral at best on Volta with
its dramatically higher SM count, but that may have also been a by-product
of this longer-run slowdown issue and even if it was not I see other things
I can do that will fill up Volta's 80 SMs. We will run JAC at over 1000
ns/day on that card.

Dave


On Fri, Dec 8, 2017 at 12:33 PM, Adrian Roitberg <roitberg.ufl.edu> wrote:

> Hi
>
> We have not been able to reproduce this.
>
> Delaram in my group just finished a test in our system,
>
> I attach an output from the following script
>
> nvidia-smi
>
> run jac regular
>
> nvidia-smi
>
> run jac long
>
> nvidia-smi
>
> run jac regular
>
>
> As you can see, the timings where actually a little bit better for the
> long run.
>
> [0] 1 x GPU: Note: The following floating-point exceptions are
> signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> | ns/day = 927.94 seconds/ns = 93.11
> [0] 1 x GPU: Note: The following floating-point exceptions are
> signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> | ns/day = 934.01 seconds/ns = 92.50
> [0] 1 x GPU: Note: The following floating-point exceptions are
> signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> | ns/day = 929.69 seconds/ns = 92.93
>
> Dave, my guess is that maybe the GPU temperature is going high ?
>
> Adrian
>
>
>
> On 12/8/17 12:17 PM, David Cerutti wrote:
>
>> OK that then confirms some odd things I had been seeing. With systems
>> larger than JAC, and probably longer overall run times, I was also seeing
>> dramatic performance decreases, to the point where our Volta was giving
>> the
>> performance of a GP100. It's good to know, then, that our Volta in Case
>> lab is not unique (uniquely broken).
>>
>> Dave
>>
>>
>> On Fri, Dec 8, 2017 at 9:11 AM, David A Case <david.case.rutgers.edu>
>> wrote:
>>
>> Hi folks:
>>>
>>> The few developers that have Volta cards have reported markedly different
>>> speedups vs. Pascal for different benchmarks.
>>>
>>> I think these may be related to the following observation: jobs seem to
>>> slow
>>> down the longer they run. You can check this on the
>>> JAC_production_NVE_4fs
>>> benchmark: make nstlim ten times large, and re-run; (you can increase
>>> ntwx
>>> and ntwr if you like--doesn't seem to make much difference).
>>>
>>> For me, the default run (250000 steps) clocks at 923 ns/day (total time
>>> is
>>> 95 sec.) This is in line with what others are getting
>>>
>>> The 10x longer run returns 824 ns/day; if I also increase ntwx by a
>>> factor of
>>> 10, I get up to 847 ns/day. (total time of 0.28 hours).
>>>
>>> A 100x run kind of plateaus at 830 ns/day (total time of 2.9 hours).
>>>
>>> For larger systems, the difference between the "short run" timings (which
>>> I suspect are typical of the official benchmarks) and "real" production
>>> runs can be larger. For at 391000 atom system, 10000 steps (82 seconds)
>>> runs at 51 ns/day, whereas 50000 steps (450 sec.) runs at 40 ns/day,
>>> and 100000 steps (900 seconds) is at 39 ns/day. These are jobs with
>>> ntwx=ntwr=100000, so there is no dumping of coordinates to disk, etc.
>>>
>>> So:
>>>
>>> 1. Be careful with benchmarks: the official JAC benchmark, at 250000
>>> steps,
>>> is not long enough for this platform. (!?) Same is probably true for
>>> other
>>> benchmarks.
>>>
>>> 2. If we can figure out what is causing the slowdown, we might see a way
>>> to
>>> get performance improvements in legacy mode.
>>>
>>> ...dac
>>>
>>>
>>> _______________________________________________
>>> AMBER-Developers mailing list
>>> AMBER-Developers.ambermd.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.am
>>> bermd.org_mailman_listinfo_amber-2Ddevelopers&d=DwICAg&c=pZJ
>>> PUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=JAg-KQEjdZeg_E8PH
>>> DDoaw&m=YSYXelmKCIYhlT-zzctQXxjKrHFtu95JjRy3mgDfeuM&s=pgme-6
>>> nlkpAMGrSzfVQK4j1XJFCr2uet6dT8Xt28MRc&e=
>>>
>>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.am
>> bermd.org_mailman_listinfo_amber-2Ddevelopers&d=DwICAg&c=pZJ
>> PUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=JAg-KQEjdZeg_E8PH
>> DDoaw&m=YSYXelmKCIYhlT-zzctQXxjKrHFtu95JjRy3mgDfeuM&s=pgme-6
>> nlkpAMGrSzfVQK4j1XJFCr2uet6dT8Xt28MRc&e=
>>
>
> --
> Dr. Adrian E. Roitberg
> University of Florida Research Foundation Professor
> Department of Chemistry
> University of Florida
> roitberg.ufl.edu
> 352-392-6972
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Dec 08 2017 - 10:00:04 PST
Custom Search