[AMBER-Developers] Random crashes AMBER 18 on GPUs from Ross Walker on 2018-06-14 (Amber Developers Archive Jun 2018)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 14 Jun 2018 16:13:29 -0400

Hi All,

I keep seeing failures with AMBER 18 when running GPU validation tests. Occasionally large tests, with or without P2P, are crashing. I initially thought this was an issue with the GPUs themselves but I've now seen this too often that I think there is an issue with the AMBER 18 GPU code itself. Here's a GPU test suite to try:

https://www.dropbox.com/s/nbxzbr8a9at9uk8/GPU_Validation_Test.tar.gz

Running this on 4 x 1080TI I see the following:

GPU_0 to GPU_4.log all give the same answers over repeated tests.

GPU.large_?.log gives:

0.0: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.1: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.2: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.3: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.4: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.5: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.6: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.7: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.8: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.9: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.0: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.1: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.2: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.3: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.4: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.5: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.6: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.7: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.8: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
1.9: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.0: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.1: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.2: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.3: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.4: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.5: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.6: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.7: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.8: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.9: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.0: 3.1: 3.2: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.3: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.4: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.5: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.6: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.7: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.8: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
3.9: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621

Checking the output from tests 3.0 and 3.1 shows it crashes immediately after printing the results header in the output file. STDERR contains:

cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered
cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered

A similar issue occurs with the GPU_P2P_1.log log:

[root.c103623 GPU_Validation_Test]# cat GPU.large_P2P_1.log
1.0: 1.1: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.2: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.3: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.4: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.5: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.6: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.7: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.8: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.9: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845

This time stderr contains: gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered

I've seen this happen on multiple machines now, with different drivers. Although all used CUDA 9.1 and 1080TIs. Have not had a chance to vary the GPU model yet.

It is not always the same GPU and not always the same test number that fails. E.g. with a repeat:

2.0: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.1: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.2: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.3: 2.4: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.5: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.6: 2.7: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.8: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
2.9: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621

Fails with the same issues though:

cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered
cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered

I have tried multiple driver branches and they all show similar issues. Amber 16 running on the same machines does not show this issue.

Repeating on an additional machine with 2 x 1080TI

0.2: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.3: 0.4: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621
0.5: Etot = -2709883.4871 EKtot = 662542.8750 EPtot = -3372426.3621

1.3: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.4: 1.5: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845
1.6: Etot = -2707218.6220 EKtot = 663143.0625 EPtot = -3370361.6845

I suspect there is an array out of bounds issue introduced in AMBER 18 but it only crashes if the memory happens to be laid out in a way that it randomly hits a memory range outside of that which was allocated. Can some of you please try the validation suite on your machines with Amber 18 and see if you see similar issues.

https://www.dropbox.com/s/nbxzbr8a9at9uk8/GPU_Validation_Test.tar.gz

Thanks,

All the best
Ross

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Jun 14 2018 - 13:30:04 PDT