Re: [AMBER-Developers] Nvidia DGX A100 from Scott Brozell on 2020-07-24 (Amber Developers Archive Jul 2020)

From: Scott Brozell <sbrozell.iqb.rutgers.edu>
Date: Fri, 24 Jul 2020 18:00:07 -0400

Hi,

Here's a snippet from the head of
logs/test_amber_cuda/2020-07-23_21-36-37.log
Running CUDA Implicit solvent tests.
Precision Model = DPFP
------------------------------------
cd gb_ala3/ && ./Run.igb1_ntc1_min DPFP yes
cudaMemcpyToSymbol: SetSim copy to cSim failed invalid device symbol
./Run.igb1_ntc1_min: Program error

Unfortunately, this is the only clue left since no .dif files are
produced when a test error terminates. In addition, this is run
in batch where my script didn't save anything but logs/.
I'll rectify that and resubmit.

scott

On Fri, Jul 24, 2020 at 09:49:57PM +0000, Scott Le Grand wrote:
> The good news is that something must be absolutely fundamentally broken on P100, why not look at one of the mdouts and see where it's failing? Guessing pretty early... Maybe the cmake doesn't have a true SM_60 code generation flag set?
>
> -----Original Message-----
> From: Scott Brozell <sbrozell.iqb.rutgers.edu>
> Sent: Friday, July 24, 2020 2:41 PM
> To: AMBER Developers Mailing List <amber-developers.ambermd.org>; David Cerutti <dscerutti.gmail.com>; Scott Le Grand <varelse2005.gmail.com>; Jason Swails <jason.swails.gmail.com>; Jonathan Lefman <jlefman.nvidia.com>; David A Case <david.case.rutgers.edu>; Scott Le Grand <slegrand.nvidia.com>; Peng Wang <penwang.nvidia.com>
> Subject: Re: [AMBER-Developers] Nvidia DGX A100
>
> External email: Use caution opening links or attachments
>
>
> Hi,
>
> In the face of silence, but since i had the jobs queued, FWIW:
> I built in the master branch via the legacy mechanism.
> My conclusion is that the cmake build Volta V100 was actually 'it worked'
> since the same diffs were obtained with the legacy produced pmemd cudas.
>
> However on the Pascal P100 disaster was confirmed:
> cmake build; ran amber tests, ie, cd test;make test.cuda.serial
> 0 file comparisons passed
> 0 file comparisons failed
> 202 tests experienced errors
>
> legacy build; ran amber tests, ie, cd test;make test.cuda.serial
> 242 file comparisons passed
> 7 file comparisons failed (2 of which can be ignored)
> 0 tests experienced errors
> Same failures as listed below for volta with the addition of possible FAILURE: (ignored) check mdout.ramd.dif
>
> Happy weekend,
> scott
>
>
> On Thu, Jul 23, 2020 at 03:03:07AM -0400, Scott Brozell wrote:
> > I copied Adrian's new CudaConfig.cmake file into
> > amber20-with-patches/cmake/CudaConfig.cmake
> > built on three platforms and ran "make test.cuda.serial".
> >
> > It worked on Tesla K40 gnu/7.3.0 cuda/9.2.88 rhels 6.10
> > 804 file comparisons passed
> > 3 file comparisons failed (3 of which can be ignored) ...
> > 249 file comparisons passed.
> >
> > It did not work on Volta V100 gnu/8.1.0 cuda/10.2.89 rhels 7.7
> > 804 file comparisons passed
> > 3 file comparisons failed (3 of which can be ignored) ...
> > 243 file comparisons passed
> > 6 file comparisons failed (1 of which can be ignored)
> > 0 tests experienced errors
> > possible FAILURE: check irest1_ntt0_igb7_ntc2.out.dif possible
> > FAILURE: check irest1_ntt0_igb8_ntc2.out.dif possible FAILURE: check
> > myoglobin_md_igb7.out.dif possible FAILURE: check
> > myoglobin_md_igb8.out.dif possible FAILURE: check
> > myoglobin_md_igb8_gbsa.out.dif possible FAILURE: (ignored) check
> > myoglobin_md_igb8_gbsa3.out.dif All 6 failures look significant.
> >
> > It did not work on Pascal P100 gnu/8.4.0 cuda/10.2.89 rhels 7.7 where
> > disaster ensued early on in the AT tests:
> > Error: No CUDA-capable devices present.
> > 0 file comparisons passed
> > 3 file comparisons failed (3 of which can be ignored)
> > 1 tests experienced errors
> > This was reproduced several times on different nodes where the GPUs
> > showed no obvious problems in a non thorough check by me, and this
> > cluster is heavily used and monitored.
> >
> > scott
> >
> >
> > On Wed, Jul 22, 2020 at 03:03:41PM -0400, David A Case wrote:
> > > On Wed, Jul 22, 2020, Jason Swails wrote:
> > > >The docker image used for Amber GPU builds is defined here:
> > >
> > > BUT: I haven't seen any one say something like "I took the released
> > > version, added Adrian's new CudaConfig.cmake file (attached here),
> > > ran "make test.cuda.serial" on my system, and it works." Pick
> > > whaever GPU you have handy.
> >
> > On Wed, Jul 22, 2020 at 04:21:43PM -0400, Adrian Roitberg wrote:
> > >
> > > For now, sorry for the noise. Peng, Jon and I are discussing this
> > > over a different channel and will try some things and get back to the group.
> > >
> > > On 7/22/20 4:18 PM, David Cerutti wrote:
> > > > *[External Email]*
> > > >
> > > > So I've looked over Adrian's CudaConfig.cmake and the difference
> > > > between it and the master branch CudaConfig.cmake I have already
> > > > tested is that $SM70FLAGS gets reset if the CUDA compiler is
> > > > version 10.0, 10.1, or 10.2.?? In master, we have:
> > > >
> > > > set(SM70FLAGS -gencode arch=compute_60,code=sm_70) (... begin an
> > > > if / elseif??/ elseif case switch ...) elseif(${CUDA_VERSION}
> > > > VERSION_EQUAL 11.0) ????set(SM70FLAGS -gencode
> > > > arch=compute_70,code=sm_70) The code then appends this and
> > > > other??SM##FLAGS variables to CUDA_NVCC_FLAGS.?? In Adrian's
> > > > CudaConfig.cmake, the
> > > >
> > > > set(SM70FLAGS -gencode arch=compute_70,code=sm_70)
> > > >
> > > > is happening if $CUDA_VERSION is 10.0, 10.1, or 10.2, as well as
> > > > if $CUDA_VERSION is 11.0.?? It seems to me that the only thing
> > > > we'd learn by retesting the master branch with this
> > > > CudaConfig.cmake is that the way we used to be doing $SM70FLAGS
> > > > was safe for Volta and will be safe again.?? Curious why Arian's file doesn't also do this for CUDA 9.0 / 9.1.
> > > >
> > > > Dave
> > > >
> > > >
> > > > On Wed, Jul 22, 2020 at 3:11 PM Scott Le Grand
> > > > <varelse2005.gmail.com <mailto:varelse2005.gmail.com>> wrote:
> > > >
> > > > Okay I am neck-deep in fixing basic failures that a debug build
> > > > would have exposed. It came down to three lines of code overriding
> > > > the behavior of nvcc illegally and incorrectly and don't ever do
> > > > it again. seriously do not override compiler macros with your idea
> > > > of how things should be no matter how hard you have convinced
> > > > yourself it's a good idea because it's not a good idea.
> > > >
> > > > But if you hand me a specific test, not a log of a bunch of tests,
> > > > I can hyper focus on that specific test with a debug build and
> > > > figure it out but I was hoping maybe just maybe some of that work
> > > > could be delegated to someone else in the Amber community and then
> > > > I could provide an assist.
> > > >
> > > > We should all be building a debug build of this application
> > > > occasionally because it will reveal all sorts of stuff that fails
> > > > invisibly in the release builds.
> > > >
> > > > I currently do not have access to Ampere Hardware. I am working on
> > > > that and there are people trying to help me change that situation
> > > > but the University of Florida has infinitely more Ampere Hardware
> > > > than I do at the moment. Come on guys we need to own our $h!+ here.
> > > >
> > > > On Wed, Jul 22, 2020, 12:07 David A Case <david.case.rutgers.edu
> > > > <mailto:david.case.rutgers.edu>> wrote:
> > > >
> > > > On Wed, Jul 22, 2020, Jason Swails wrote:
> > > > >
> > > > >The docker image used for Amber GPU builds is defined here:
> > > >
> > > > This is a long email thread, and I understand people saying
> > > > things like:
> > > > "we don't have enough information to say why this test or that
> > > > test is
> > > > failing on the Jenkins/CI" server.
> > > >
> > > > BUT: I haven't seen any one say something like "I took the
> > > > released
> > > > version, added Adrian's new CudaConfig.cmake file (attached
> > > > here), ran
> > > > "make test.cuda.serial" on my system, and it works." Pick
> > > > whaever GPU you
> > > > have handy.
> > > >
> > > > This doesn't require any access to gitlab.?? And, if a failure
> > > > occurs, at
> > > > least that person has a machine on which debugging might be
> > > > done.?? If
> > > > no failures occur, *then* we can start to point the finger at
> > > > the CI
> > > > configuration, or maybe something specific to 780Ti cards.
> > > >
> > > > My frustration is this: we shouldn't be relying on
> > > > Jason/Jenkins/CI to
> > > > be testing things related to GPU problems.?? There are dozens
> > > > of Amber
> > > > developers who could try this out, and report what they find.??
> > > > (I know I
> > > > am one of them, but I'm already spending hours every day on
> > > > Amber-related
> > > > business.)
> > > >
> > > >
> > > > Various comments from the email thread that might be helpful:
> > > >
> > > > >>
> > > > >> It looks the only change in CudaConfig.cmake is really
> > > > switching from
> > > > >> using ???compute70,sm70??? to ???compute_60,sm70???. That was a
> > > > topic we discussed
> > > > >> in length. But it???s unclear whether that???s the reason for
> > > > this test
> > > > >> failure.
> > > > >>
> > > > >> Can we have a version that keeps EVERUTHING the same as
> > > > before form
> > > > >> Aber20, but just adds support for Ampere/cuda11 clenly ?
> > > > >>
> > > > >> On 7/14/20 6:37 PM, Peng Wang wrote:
> > > > >>
> > > > >> In any case, it should work if she just replace the
> > > > attached file by the
> > > > >> master branch version.
> > > >
> > > > What I want to do is replace "should work" with "works for me
> > > > on this GPU".

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Jul 24 2020 - 15:30:03 PDT