Re: [AMBER-Developers] Nvidia DGX A100

From: Scott Brozell <sbrozell.iqb.rutgers.edu>
Date: Thu, 23 Jul 2020 03:03:07 -0400

Hi,

I copied Adrian's new CudaConfig.cmake file into
amber20-with-patches/cmake/CudaConfig.cmake
built on three platforms and ran "make test.cuda.serial".

It worked on Tesla K40 gnu/7.3.0 cuda/9.2.88 rhels 6.10
804 file comparisons passed
3 file comparisons failed (3 of which can be ignored)
...
249 file comparisons passed.

It did not work on Volta V100 gnu/8.1.0 cuda/10.2.89 rhels 7.7
804 file comparisons passed
3 file comparisons failed (3 of which can be ignored)
...
243 file comparisons passed
6 file comparisons failed (1 of which can be ignored)
0 tests experienced errors
possible FAILURE: check irest1_ntt0_igb7_ntc2.out.dif
possible FAILURE: check irest1_ntt0_igb8_ntc2.out.dif
possible FAILURE: check myoglobin_md_igb7.out.dif
possible FAILURE: check myoglobin_md_igb8.out.dif
possible FAILURE: check myoglobin_md_igb8_gbsa.out.dif
possible FAILURE: (ignored) check myoglobin_md_igb8_gbsa3.out.dif
All 6 failures look significant.

It did not work on Pascal P100 gnu/8.4.0 cuda/10.2.89 rhels 7.7
where disaster ensued early on in the AT tests:
Error: No CUDA-capable devices present.
0 file comparisons passed
3 file comparisons failed (3 of which can be ignored)
1 tests experienced errors
This was reproduced several times on different nodes where the GPUs
showed no obvious problems in a non thorough check by me, and this
cluster is heavily used and monitored.

scott


On Wed, Jul 22, 2020 at 03:03:41PM -0400, David A Case wrote:
> On Wed, Jul 22, 2020, Jason Swails wrote:
> >The docker image used for Amber GPU builds is defined here:
>
> BUT: I haven't seen any one say something like "I took the released
> version, added Adrian's new CudaConfig.cmake file (attached here), ran
> "make test.cuda.serial" on my system, and it works." Pick whaever GPU you
> have handy.

On Wed, Jul 22, 2020 at 04:21:43PM -0400, Adrian Roitberg wrote:
>
> For now, sorry for the noise. Peng, Jon and I are discussing this over a
> different channel and will try some things and get back to the group.
>
> On 7/22/20 4:18 PM, David Cerutti wrote:
> > *[External Email]*
> >
> > So I've looked over Adrian's CudaConfig.cmake and the difference
> > between it and the master branch CudaConfig.cmake I have already
> > tested is that $SM70FLAGS gets reset if the CUDA compiler is version
> > 10.0, 10.1, or 10.2.?? In master, we have:
> >
> > set(SM70FLAGS -gencode arch=compute_60,code=sm_70)
> > (... begin an if / elseif??/ elseif case switch ...)
> > elseif(${CUDA_VERSION} VERSION_EQUAL 11.0)
> > ????set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
> > The code then appends this and other??SM##FLAGS variables to
> > CUDA_NVCC_FLAGS.?? In Adrian's CudaConfig.cmake, the
> >
> > set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
> >
> > is happening if $CUDA_VERSION is 10.0, 10.1, or 10.2, as well as if
> > $CUDA_VERSION is 11.0.?? It seems to me that the only thing we'd learn
> > by retesting the master branch with this CudaConfig.cmake is that the
> > way we used to be doing $SM70FLAGS was safe for Volta and will be safe
> > again.?? Curious why Arian's file doesn't also do this for CUDA 9.0 / 9.1.
> >
> > Dave
> >
> >
> > On Wed, Jul 22, 2020 at 3:11 PM Scott Le Grand <varelse2005.gmail.com
> > <mailto:varelse2005.gmail.com>> wrote:
> >
> > Okay I am neck-deep in fixing basic failures that a debug build
> > would have exposed. It came down to three lines of code overriding
> > the behavior of nvcc illegally and incorrectly and don't ever do
> > it again. seriously do not override compiler macros with your idea
> > of how things should be no matter how hard you have convinced
> > yourself it's a good idea because it's not a good idea.
> >
> > But if you hand me a specific test, not a log of a bunch of tests,
> > I can hyper focus on that specific test with a debug build and
> > figure it out but I was hoping maybe just maybe some of that work
> > could be delegated to someone else in the Amber community and then
> > I could provide an assist.
> >
> > We should all be building a debug build of this application
> > occasionally because it will reveal all sorts of stuff that fails
> > invisibly in the release builds.
> >
> > I currently do not have access to Ampere Hardware. I am working on
> > that and there are people trying to help me change that situation
> > but the University of Florida has infinitely more Ampere Hardware
> > than I do at the moment. Come on guys we need to own our $h!+ here.
> >
> > On Wed, Jul 22, 2020, 12:07 David A Case <david.case.rutgers.edu
> > <mailto:david.case.rutgers.edu>> wrote:
> >
> > On Wed, Jul 22, 2020, Jason Swails wrote:
> > >
> > >The docker image used for Amber GPU builds is defined here:
> >
> > This is a long email thread, and I understand people saying
> > things like:
> > "we don't have enough information to say why this test or that
> > test is
> > failing on the Jenkins/CI" server.
> >
> > BUT: I haven't seen any one say something like "I took the
> > released
> > version, added Adrian's new CudaConfig.cmake file (attached
> > here), ran
> > "make test.cuda.serial" on my system, and it works." Pick
> > whaever GPU you
> > have handy.
> >
> > This doesn't require any access to gitlab.?? And, if a failure
> > occurs, at
> > least that person has a machine on which debugging might be
> > done.?? If
> > no failures occur, *then* we can start to point the finger at
> > the CI
> > configuration, or maybe something specific to 780Ti cards.
> >
> > My frustration is this: we shouldn't be relying on
> > Jason/Jenkins/CI to
> > be testing things related to GPU problems.?? There are dozens
> > of Amber
> > developers who could try this out, and report what they find.??
> > (I know I
> > am one of them, but I'm already spending hours every day on
> > Amber-related
> > business.)
> >
> >
> > Various comments from the email thread that might be helpful:
> >
> > >>
> > >> It looks the only change in CudaConfig.cmake is really
> > switching from
> > >> using ???compute70,sm70??? to ???compute_60,sm70???. That was a
> > topic we discussed
> > >> in length. But it???s unclear whether that???s the reason for
> > this test
> > >> failure.
> > >>
> > >> Can we have a version that keeps EVERUTHING the same as
> > before form
> > >> Aber20, but just adds support for Ampere/cuda11 clenly ?
> > >>
> > >> On 7/14/20 6:37 PM, Peng Wang wrote:
> > >>
> > >> In any case, it should work if she just replace the
> > attached file by the
> > >> master branch version.
> >
> > What I want to do is replace "should work" with "works for me
> > on this GPU".

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Jul 23 2020 - 00:30:03 PDT
Custom Search