Yes this is a LONG standing bug in the NVIDIA drivers - I think there is
an actual NVIDIA bug filed for it but I can't recall the ID right now.
Essentially it is a flaw in the way they implement unified memory -
although it's really 6 of one and half a dozen of the other since it is a
hack for something that I think is problematical in the linux kernel
itself. Ultimately they have to allocate huge amounts of virtual memory -
however it is instantly paged out so is not an issue.
So while this is an issue - it is also a non-issue from almost all
perspectives EXCEPT queuing systems which don't correctly track memory
usage of an application. It will happen with all CUDA codes. The solution
is:
1) Don't enforce memory limits - probably the easiest solution.
2) Figure out a way for the queuing system to look at resident memory
rather than virtual memory - not sure if slurm can do this but if it can't
someone should probably file a bug report with Slurm and find a way to
cross link it with the NVIDIA driver people.
I'd go for 1 since it is simple.
All the best
Ross
On 11/7/14, 11:28 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>On Fri, 2014-11-07 at 11:37 -0700, Thomas Cheatham wrote:
>> Any body have some ideas about this? Basically "cgroups" are a way to
>> create a virtual container, one which you can restrict memory to a
>> sub-process, etc (i.e. for example to partition a node into two
>> independent halves). Thanks! --tom
>
>Just to add a little to Scott's comment: this seems to be an issue with
>the CUDA RT in general. I ran a quick test on my machine where I
>started up CUDA-enabled VMD with a single PDB file and another small
>simulation with OpenMM's CUDA platform.
>
>Both programs consumed around 37 GB of virtual memory (not real memory,
>though) on my desktop (with 16 GB of RAM total). When pmemd.cuda runs
>on my machine, it consumes the same amount of virtual RAM. I would try
>some of the CUDA SDK codes to confirm the issue in those, too, but they
>don't run long enough to actually monitor the memory usage.
>
>So it's definitely not just Amber -- it's every other CUDA-enabled
>program I tried running on my machine too. I know this has been
>discussed in a few threads in the past, but I couldn't seem to find them
>in the archives easily. (Not sure if it was amber-dev or amber, to be
>honest).
>
>This was all done with the nVidia driver 340.32 and CUDA 5.5 on my
>machine (although I've observed it for every other driver version I've
>had, too, which is quite a few).
>
>All the best,
>Jason
>
>--
>Jason M. Swails
>BioMaPS,
>Rutgers University
>Postdoctoral Researcher
>
>
>_______________________________________________
>AMBER-Developers mailing list
>AMBER-Developers.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber-developers
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Wed Nov 12 2014 - 00:00:02 PST