Re: [AMBER-Developers] weird behavior for pmemd.cuda on Volta cards

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sat, 9 Dec 2017 12:33:49 -0500

Hi Dave

The published benchmarks for GTX cards are with base clocks. I suspect Adrian's GTX-1080TIs are overclocked models which would explain it. I always publish benchmarks based on base clocks since I think that gives a fairer representation of the performance to expect for our end users. I.e. the benchmarks numbers should represent a floor to the performance. Our users are thus happy when they see better performance than what is published. :-)

BTW, Dave something you will likely rediscover in your work that Scott and I originally found around 2013. The temperature and power caps (they are independent) are a real pain in the butt for development. Numerous times we put in optimizations that boosted throughput only to find out when running extended benchmarks that it was actually 'too efficient' for the GPU and caused it to either consume more power or run hotter which ultimately caused it to clock down and in a many cases actually run slower than the 'less efficient' code. You pretty much need to keep a box running flat out and at production temperature while doing development work and then swap in and out your benchmarks for your optimized code. Always running at least 5 mins to get a reliable performance number. Trust me you will pull out lots of hair chasing the damn boost clock - it is super annoying.

Then count yourself lucky that you are running on one GPU. Try to optimize code that runs across multiple GPUs on multiple nodes where they are all in different states of boost clock and you will discover what insanity feels like.

All the best
Ross

> On Dec 8, 2017, at 15:23, David Cerutti <dscerutti.gmail.com> wrote:
>
> One thing that's struck me about Adrian's benchmarks is that they
> consistently seem to beat the published values. Back in August Adrian's
> crew was getting 698ns on JAC 4fs NVE with Amber16 / GTX-1080Ti, where the
> website published 624. He got 765 with my improvements as of that date, so
> I think now he should be pushing 800 on that card. We'll see. Once we
> figure out what's going on with our Volta I think our next steps on the
> code will become clear.
>
> On Fri, Dec 8, 2017 at 3:00 PM, David A Case <david.case.rutgers.edu> wrote:
>
>> On Fri, Dec 08, 2017, Dan Roe wrote:
>>>
>>> One simple way to test this is when you notice the slowdown run
>>> 'nvidia-smi' and note the performance state. If it's anything but P0
>>> you're being throttled. I believe this can also happen due to too much
>>> power consumption.
>>
>> The Perf value stays at P0, but I notice that the power usage declines
>> from 139W to 103W, then levels out at about 120W. Cap is at 250W.
>> Temperature never changes from 82C.
>>
>> It's still not clear to me that we actually have an identical software
>> environment to what Adrian has. But I admit it looks more like
>> hardware....
>>
>> ....dac
>>
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers


_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Sat Dec 09 2017 - 10:00:01 PST
Custom Search