Re: [AMBER-Developers] Nvidia DGX A100 from Scott Brozell on 2020-07-24 (Amber Developers Archive Jul 2020)

From: Scott Brozell <sbrozell.iqb.rutgers.edu>
Date: Fri, 24 Jul 2020 17:40:51 -0400

Hi,

In the face of silence, but since i had the jobs queued, FWIW:
I built in the master branch via the legacy mechanism.
My conclusion is that the cmake build Volta V100 was actually 'it worked'
since the same diffs were obtained with the legacy produced pmemd cudas.

However on the Pascal P100 disaster was confirmed:
cmake build; ran amber tests, ie, cd test;make test.cuda.serial
0 file comparisons passed
0 file comparisons failed
202 tests experienced errors

legacy build; ran amber tests, ie, cd test;make test.cuda.serial
242 file comparisons passed
7 file comparisons failed (2 of which can be ignored)
0 tests experienced errors
Same failures as listed below for volta with the addition of
possible FAILURE: (ignored) check mdout.ramd.dif

Happy weekend,
scott

On Thu, Jul 23, 2020 at 03:03:07AM -0400, Scott Brozell wrote:
> I copied Adrian's new CudaConfig.cmake file into
> amber20-with-patches/cmake/CudaConfig.cmake
> built on three platforms and ran "make test.cuda.serial".
>
> It worked on Tesla K40 gnu/7.3.0 cuda/9.2.88 rhels 6.10
> 804 file comparisons passed
> 3 file comparisons failed (3 of which can be ignored)
> ...
> 249 file comparisons passed.
>
> It did not work on Volta V100 gnu/8.1.0 cuda/10.2.89 rhels 7.7
> 804 file comparisons passed
> 3 file comparisons failed (3 of which can be ignored)
> ...
> 243 file comparisons passed
> 6 file comparisons failed (1 of which can be ignored)
> 0 tests experienced errors
> possible FAILURE: check irest1_ntt0_igb7_ntc2.out.dif
> possible FAILURE: check irest1_ntt0_igb8_ntc2.out.dif
> possible FAILURE: check myoglobin_md_igb7.out.dif
> possible FAILURE: check myoglobin_md_igb8.out.dif
> possible FAILURE: check myoglobin_md_igb8_gbsa.out.dif
> possible FAILURE: (ignored) check myoglobin_md_igb8_gbsa3.out.dif
> All 6 failures look significant.
>
> It did not work on Pascal P100 gnu/8.4.0 cuda/10.2.89 rhels 7.7
> where disaster ensued early on in the AT tests:
> Error: No CUDA-capable devices present.
> 0 file comparisons passed
> 3 file comparisons failed (3 of which can be ignored)
> 1 tests experienced errors
> This was reproduced several times on different nodes where the GPUs
> showed no obvious problems in a non thorough check by me, and this
> cluster is heavily used and monitored.
>
> scott
>
>
> On Wed, Jul 22, 2020 at 03:03:41PM -0400, David A Case wrote:
> > On Wed, Jul 22, 2020, Jason Swails wrote:
> > >The docker image used for Amber GPU builds is defined here:
> >
> > BUT: I haven't seen any one say something like "I took the released
> > version, added Adrian's new CudaConfig.cmake file (attached here), ran
> > "make test.cuda.serial" on my system, and it works." Pick whaever GPU you
> > have handy.
>
> On Wed, Jul 22, 2020 at 04:21:43PM -0400, Adrian Roitberg wrote:
> >
> > For now, sorry for the noise. Peng, Jon and I are discussing this over a
> > different channel and will try some things and get back to the group.
> >
> > On 7/22/20 4:18 PM, David Cerutti wrote:
> > > *[External Email]*
> > >
> > > So I've looked over Adrian's CudaConfig.cmake and the difference
> > > between it and the master branch CudaConfig.cmake I have already
> > > tested is that $SM70FLAGS gets reset if the CUDA compiler is version
> > > 10.0, 10.1, or 10.2.?? In master, we have:
> > >
> > > set(SM70FLAGS -gencode arch=compute_60,code=sm_70)
> > > (... begin an if / elseif??/ elseif case switch ...)
> > > elseif(${CUDA_VERSION} VERSION_EQUAL 11.0)
> > > ????set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
> > > The code then appends this and other??SM##FLAGS variables to
> > > CUDA_NVCC_FLAGS.?? In Adrian's CudaConfig.cmake, the
> > >
> > > set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
> > >
> > > is happening if $CUDA_VERSION is 10.0, 10.1, or 10.2, as well as if
> > > $CUDA_VERSION is 11.0.?? It seems to me that the only thing we'd learn
> > > by retesting the master branch with this CudaConfig.cmake is that the
> > > way we used to be doing $SM70FLAGS was safe for Volta and will be safe
> > > again.?? Curious why Arian's file doesn't also do this for CUDA 9.0 / 9.1.
> > >
> > > Dave
> > >
> > >
> > > On Wed, Jul 22, 2020 at 3:11 PM Scott Le Grand <varelse2005.gmail.com
> > > <mailto:varelse2005.gmail.com>> wrote:
> > >
> > > Okay I am neck-deep in fixing basic failures that a debug build
> > > would have exposed. It came down to three lines of code overriding
> > > the behavior of nvcc illegally and incorrectly and don't ever do
> > > it again. seriously do not override compiler macros with your idea
> > > of how things should be no matter how hard you have convinced
> > > yourself it's a good idea because it's not a good idea.
> > >
> > > But if you hand me a specific test, not a log of a bunch of tests,
> > > I can hyper focus on that specific test with a debug build and
> > > figure it out but I was hoping maybe just maybe some of that work
> > > could be delegated to someone else in the Amber community and then
> > > I could provide an assist.
> > >
> > > We should all be building a debug build of this application
> > > occasionally because it will reveal all sorts of stuff that fails
> > > invisibly in the release builds.
> > >
> > > I currently do not have access to Ampere Hardware. I am working on
> > > that and there are people trying to help me change that situation
> > > but the University of Florida has infinitely more Ampere Hardware
> > > than I do at the moment. Come on guys we need to own our $h!+ here.
> > >
> > > On Wed, Jul 22, 2020, 12:07 David A Case <david.case.rutgers.edu
> > > <mailto:david.case.rutgers.edu>> wrote:
> > >
> > > On Wed, Jul 22, 2020, Jason Swails wrote:
> > > >
> > > >The docker image used for Amber GPU builds is defined here:
> > >
> > > This is a long email thread, and I understand people saying
> > > things like:
> > > "we don't have enough information to say why this test or that
> > > test is
> > > failing on the Jenkins/CI" server.
> > >
> > > BUT: I haven't seen any one say something like "I took the
> > > released
> > > version, added Adrian's new CudaConfig.cmake file (attached
> > > here), ran
> > > "make test.cuda.serial" on my system, and it works." Pick
> > > whaever GPU you
> > > have handy.
> > >
> > > This doesn't require any access to gitlab.?? And, if a failure
> > > occurs, at
> > > least that person has a machine on which debugging might be
> > > done.?? If
> > > no failures occur, *then* we can start to point the finger at
> > > the CI
> > > configuration, or maybe something specific to 780Ti cards.
> > >
> > > My frustration is this: we shouldn't be relying on
> > > Jason/Jenkins/CI to
> > > be testing things related to GPU problems.?? There are dozens
> > > of Amber
> > > developers who could try this out, and report what they find.??
> > > (I know I
> > > am one of them, but I'm already spending hours every day on
> > > Amber-related
> > > business.)
> > >
> > >
> > > Various comments from the email thread that might be helpful:
> > >
> > > >>
> > > >> It looks the only change in CudaConfig.cmake is really
> > > switching from
> > > >> using ???compute70,sm70??? to ???compute_60,sm70???. That was a
> > > topic we discussed
> > > >> in length. But it???s unclear whether that???s the reason for
> > > this test
> > > >> failure.
> > > >>
> > > >> Can we have a version that keeps EVERUTHING the same as
> > > before form
> > > >> Aber20, but just adds support for Ampere/cuda11 clenly ?
> > > >>
> > > >> On 7/14/20 6:37 PM, Peng Wang wrote:
> > > >>
> > > >> In any case, it should work if she just replace the
> > > attached file by the
> > > >> master branch version.
> > >
> > > What I want to do is replace "should work" with "works for me
> > > on this GPU".
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.ambermd.org%2Fmailman%2Flistinfo%2Famber-developers&data=02%7C01%7Csbrozell%40iqb.rutgers.edu%7Ca5882837dc9f4151dac508d82ed67faa%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637310846088685106&sdata=mcXPzG1Uh1bmxNJuih%2B1cLJzP2xbgsvqTubkGB70aC4%3D&reserved=0

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Jul 24 2020 - 15:00:04 PDT