Re: [AMBER-Developers] Nvidia DGX A100 from David Cerutti on 2020-07-22 (Amber Developers Archive Jul 2020)

From: David Cerutti <dscerutti.gmail.com>
Date: Wed, 22 Jul 2020 16:18:55 -0400

So I've looked over Adrian's CudaConfig.cmake and the difference between it
and the master branch CudaConfig.cmake I have already tested is that
$SM70FLAGS gets reset if the CUDA compiler is version 10.0, 10.1, or 10.2.
In master, we have:

set(SM70FLAGS -gencode arch=compute_60,code=sm_70)
(... begin an if / elseif / elseif case switch ...)
elseif(${CUDA_VERSION} VERSION_EQUAL 11.0)
set(SM70FLAGS -gencode arch=compute_70,code=sm_70)

The code then appends this and other SM##FLAGS variables to
CUDA_NVCC_FLAGS. In Adrian's CudaConfig.cmake, the

set(SM70FLAGS -gencode arch=compute_70,code=sm_70)

is happening if $CUDA_VERSION is 10.0, 10.1, or 10.2, as well as if
$CUDA_VERSION is 11.0. It seems to me that the only thing we'd learn by
retesting the master branch with this CudaConfig.cmake is that the way we
used to be doing $SM70FLAGS was safe for Volta and will be safe again.
Curious why Arian's file doesn't also do this for CUDA 9.0 / 9.1.

Dave

On Wed, Jul 22, 2020 at 3:11 PM Scott Le Grand <varelse2005.gmail.com>
wrote:

> Okay I am neck-deep in fixing basic failures that a debug build would have
> exposed. It came down to three lines of code overriding the behavior of
> nvcc illegally and incorrectly and don't ever do it again. seriously do not
> override compiler macros with your idea of how things should be no matter
> how hard you have convinced yourself it's a good idea because it's not a
> good idea.
>
> But if you hand me a specific test, not a log of a bunch of tests, I can
> hyper focus on that specific test with a debug build and figure it out but
> I was hoping maybe just maybe some of that work could be delegated to
> someone else in the Amber community and then I could provide an assist.
>
> We should all be building a debug build of this application occasionally
> because it will reveal all sorts of stuff that fails invisibly in the
> release builds.
>
> I currently do not have access to Ampere Hardware. I am working on that
> and there are people trying to help me change that situation but the
> University of Florida has infinitely more Ampere Hardware than I do at the
> moment. Come on guys we need to own our $h!+ here.
>
> On Wed, Jul 22, 2020, 12:07 David A Case <david.case.rutgers.edu> wrote:
>
>> On Wed, Jul 22, 2020, Jason Swails wrote:
>> >
>> >The docker image used for Amber GPU builds is defined here:
>>
>> This is a long email thread, and I understand people saying things like:
>> "we don't have enough information to say why this test or that test is
>> failing on the Jenkins/CI" server.
>>
>> BUT: I haven't seen any one say something like "I took the released
>> version, added Adrian's new CudaConfig.cmake file (attached here), ran
>> "make test.cuda.serial" on my system, and it works." Pick whaever GPU you
>> have handy.
>>
>> This doesn't require any access to gitlab. And, if a failure occurs, at
>> least that person has a machine on which debugging might be done. If
>> no failures occur, *then* we can start to point the finger at the CI
>> configuration, or maybe something specific to 780Ti cards.
>>
>> My frustration is this: we shouldn't be relying on Jason/Jenkins/CI to
>> be testing things related to GPU problems. There are dozens of Amber
>> developers who could try this out, and report what they find. (I know I
>> am one of them, but I'm already spending hours every day on Amber-related
>> business.)
>>
>>
>> Various comments from the email thread that might be helpful:
>>
>> >>
>> >> It looks the only change in CudaConfig.cmake is really switching from
>> >> using “compute70,sm70” to “compute_60,sm70”. That was a topic we
>> discussed
>> >> in length. But it’s unclear whether that’s the reason for this test
>> >> failure.
>> >>
>> >> Can we have a version that keeps EVERUTHING the same as before form
>> >> Aber20, but just adds support for Ampere/cuda11 clenly ?
>> >>
>> >> On 7/14/20 6:37 PM, Peng Wang wrote:
>> >>
>> >> In any case, it should work if she just replace the attached file by
>> the
>> >> master branch version.
>>
>> What I want to do is replace "should work" with "works for me on this
>> GPU".
>>
>> ....dac
>>
>> _______________________________________________
>> AMBER-Developers mailing list
>> AMBER-Developers.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber-developers
>>
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Jul 22 2020 - 13:30:04 PDT