Re: [AMBER-Developers] Nvidia DGX A100

From: Adrian Roitberg <roitberg.ufl.edu>
Date: Wed, 22 Jul 2020 16:21:43 -0400

Hi


For now, sorry for the noise. Peng, Jon and I are discussing this over a
different channel and will try some things and get back to the group.

Thanks !

adrian


On 7/22/20 4:18 PM, David Cerutti wrote:
> *[External Email]*
>
> So I've looked over Adrian's CudaConfig.cmake and the difference
> between it and the master branch CudaConfig.cmake I have already
> tested is that $SM70FLAGS gets reset if the CUDA compiler is version
> 10.0, 10.1, or 10.2.  In master, we have:
>
> set(SM70FLAGS -gencode arch=compute_60,code=sm_70)
> (... begin an if / elseif / elseif case switch ...)
> elseif(${CUDA_VERSION} VERSION_EQUAL 11.0)
>   set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
> The code then appends this and other SM##FLAGS variables to
> CUDA_NVCC_FLAGS.  In Adrian's CudaConfig.cmake, the
>
> set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
>
> is happening if $CUDA_VERSION is 10.0, 10.1, or 10.2, as well as if
> $CUDA_VERSION is 11.0.  It seems to me that the only thing we'd learn
> by retesting the master branch with this CudaConfig.cmake is that the
> way we used to be doing $SM70FLAGS was safe for Volta and will be safe
> again.  Curious why Arian's file doesn't also do this for CUDA 9.0 / 9.1.
>
> Dave
>
>
> On Wed, Jul 22, 2020 at 3:11 PM Scott Le Grand <varelse2005.gmail.com
> <mailto:varelse2005.gmail.com>> wrote:
>
> Okay I am neck-deep in fixing basic failures that a debug build
> would have exposed. It came down to three lines of code overriding
> the behavior of nvcc illegally and incorrectly and don't ever do
> it again. seriously do not override compiler macros with your idea
> of how things should be no matter how hard you have convinced
> yourself it's a good idea because it's not a good idea.
>
> But if you hand me a specific test, not a log of a bunch of tests,
> I can hyper focus on that specific test with a debug build and
> figure it out but I was hoping maybe just maybe some of that work
> could be delegated to someone else in the Amber community and then
> I could provide an assist.
>
> We should all be building a debug build of this application
> occasionally because it will reveal all sorts of stuff that fails
> invisibly in the release builds.
>
> I currently do not have access to Ampere Hardware. I am working on
> that and there are people trying to help me change that situation
> but the University of Florida has infinitely more Ampere Hardware
> than I do at the moment. Come on guys we need to own our $h!+ here.
>
> On Wed, Jul 22, 2020, 12:07 David A Case <david.case.rutgers.edu
> <mailto:david.case.rutgers.edu>> wrote:
>
> On Wed, Jul 22, 2020, Jason Swails wrote:
> >
> >The docker image used for Amber GPU builds is defined here:
>
> This is a long email thread, and I understand people saying
> things like:
> "we don't have enough information to say why this test or that
> test is
> failing on the Jenkins/CI" server.
>
> BUT: I haven't seen any one say something like "I took the
> released
> version, added Adrian's new CudaConfig.cmake file (attached
> here), ran
> "make test.cuda.serial" on my system, and it works." Pick
> whaever GPU you
> have handy.
>
> This doesn't require any access to gitlab.  And, if a failure
> occurs, at
> least that person has a machine on which debugging might be
> done.  If
> no failures occur, *then* we can start to point the finger at
> the CI
> configuration, or maybe something specific to 780Ti cards.
>
> My frustration is this: we shouldn't be relying on
> Jason/Jenkins/CI to
> be testing things related to GPU problems.  There are dozens
> of Amber
> developers who could try this out, and report what they find. 
> (I know I
> am one of them, but I'm already spending hours every day on
> Amber-related
> business.)
>
>
> Various comments from the email thread that might be helpful:
>
> >>
> >> It looks the only change in CudaConfig.cmake is really
> switching from
> >> using “compute70,sm70” to “compute_60,sm70”. That was a
> topic we discussed
> >> in length. But it’s unclear whether that’s the reason for
> this test
> >> failure.
> >>
> >> Can we have a version that keeps EVERUTHING the same as
> before form
> >> Aber20, but just adds support for Ampere/cuda11 clenly ?
> >>
> >> On 7/14/20 6:37 PM, Peng Wang wrote:
> >>
> >> In any case, it should work if she just replace the
> attached file by the
> >> master branch version.
>
> What I want to do is replace "should work" with "works for me
> on this GPU".
>
> ....dac
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org <mailto:AMBER-Developers.ambermd.org>
> http://lists.ambermd.org/mailman/listinfo/amber-developers
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ambermd.org_mailman_listinfo_amber-2Ddevelopers&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=dl7Zd5Rzbdvo14I2ndQf4w&m=GruLAKB6Zg4gK5D8zoP6ts223nI3dwI2JQAr_1OU6vA&s=Oik_qUYKJoy1yQkLLOMbQ46VS_qyQYG3S5PZsHZXXFM&e=>
>
-- 
Dr. Adrian E. Roitberg
V.T. and Louise Jackson Professor in Chemistry
Department of Chemistry
University of Florida
roitberg.ufl.edu
352-392-6972
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Jul 22 2020 - 13:30:03 PDT
Custom Search