Re: [AMBER-Developers] Nvidia DGX A100 from Scott Le Grand on 2020-07-22 (Amber Developers Archive Jul 2020)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 22 Jul 2020 12:11:29 -0700

Okay I am neck-deep in fixing basic failures that a debug build would have
exposed. It came down to three lines of code overriding the behavior of
nvcc illegally and incorrectly and don't ever do it again. seriously do not
override compiler macros with your idea of how things should be no matter
how hard you have convinced yourself it's a good idea because it's not a
good idea.

But if you hand me a specific test, not a log of a bunch of tests, I can
hyper focus on that specific test with a debug build and figure it out but
I was hoping maybe just maybe some of that work could be delegated to
someone else in the Amber community and then I could provide an assist.

We should all be building a debug build of this application occasionally
because it will reveal all sorts of stuff that fails invisibly in the
release builds.

I currently do not have access to Ampere Hardware. I am working on that and
there are people trying to help me change that situation but the University
of Florida has infinitely more Ampere Hardware than I do at the moment.
Come on guys we need to own our $h!+ here.

On Wed, Jul 22, 2020, 12:07 David A Case <david.case.rutgers.edu> wrote:

> On Wed, Jul 22, 2020, Jason Swails wrote:
> >
> >The docker image used for Amber GPU builds is defined here:
>
> This is a long email thread, and I understand people saying things like:
> "we don't have enough information to say why this test or that test is
> failing on the Jenkins/CI" server.
>
> BUT: I haven't seen any one say something like "I took the released
> version, added Adrian's new CudaConfig.cmake file (attached here), ran
> "make test.cuda.serial" on my system, and it works." Pick whaever GPU you
> have handy.
>
> This doesn't require any access to gitlab. And, if a failure occurs, at
> least that person has a machine on which debugging might be done. If
> no failures occur, *then* we can start to point the finger at the CI
> configuration, or maybe something specific to 780Ti cards.
>
> My frustration is this: we shouldn't be relying on Jason/Jenkins/CI to
> be testing things related to GPU problems. There are dozens of Amber
> developers who could try this out, and report what they find. (I know I
> am one of them, but I'm already spending hours every day on Amber-related
> business.)
>
>
> Various comments from the email thread that might be helpful:
>
> >>
> >> It looks the only change in CudaConfig.cmake is really switching from
> >> using “compute70,sm70” to “compute_60,sm70”. That was a topic we
> discussed
> >> in length. But it’s unclear whether that’s the reason for this test
> >> failure.
> >>
> >> Can we have a version that keeps EVERUTHING the same as before form
> >> Aber20, but just adds support for Ampere/cuda11 clenly ?
> >>
> >> On 7/14/20 6:37 PM, Peng Wang wrote:
> >>
> >> In any case, it should work if she just replace the attached file by the
> >> master branch version.
>
> What I want to do is replace "should work" with "works for me on this GPU".
>
> ....dac
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Jul 22 2020 - 13:00:05 PDT