Re: [AMBER-Developers] [campuschampions] Amber/cgroup/GPU (fwd)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Fri, 7 Nov 2014 10:45:05 -0800

That's a driver issue. I have no idea what the root cause is, but it's not
AMBER...


On Fri, Nov 7, 2014 at 10:37 AM, Thomas Cheatham <tec3.utah.edu> wrote:

>
> Any body have some ideas about this? Basically "cgroups" are a way to
> create a virtual container, one which you can restrict memory to a
> sub-process, etc (i.e. for example to partition a node into two
> independent halves). Thanks! --tom
>
>
> ---------- Forwarded message ----------
> Date: Fri, 7 Nov 2014 11:15:33 -0700
> From: Kilian Cavalotti <kilian.stanford.edu>
> To: Thomas Cheatham <tec3.utah.edu>
> Subject: Re: [campuschampions] Amber/cgroup/GPU
>
> Hi Thomas,
>
> > Yes. I am an AMBER developer and also a domain champion for molecular
> > dynamics, whatever that means...
>
> > Contact me offlist with issues and we will try to help out. --tom
>
> Awesome! Thanks a lot for getting back to me.
>
> So here's the deal: we don't exactly know what the problem is right now,
> but we just noticed that running AMBER simulations (even just the ones
> from the benchmark suite) inside Slurm lead to the job being killed for
> exceeding its memory limit.
>
> Our Slurm config uses cgroups for limits enforcement, including memory
> (see http://slurm.schedmd.com/cgroups.html and
> http://slurm.schedmd.com/cgroup.conf.html for details).
>
> We tried to run the PME/JAC_production_NVE_4fs benchmark on 2 CPUs and 2
> GPUs, and it fails even with a memory limit set to 256GB (the amount of
> physical memory on the node). When run outside of Slurm, it works fine.
>
> When run on only 1 CPU, it works fine too, whatever the number of GPUs
> is (but I don't know if multiple GPUs are actually used in that case).
>
> >From my limited understanding so far, it looks like AMBER tries to
> "probe" the amount of available memory when it starts. The problem is
> that the cgroup memory subsystem does not limit those allocations, and
> AMBER is allowed to allocate all of the node physical memory. But when
> it tries to actually use it, the resident memory amount exceeds the
> limit at some point and it gets killed.
>
> The leads to messages like:
>
> slurmstepd: Job 1007091 exceeded virtual memory limit (297308404 >
> 262144000), being killed
> slurmstepd: Exceeded job memory limit
>
> And yes, 262144000 is in KB, so it's about 256GB!
>
> So I was wondering if you could shed any light on this problem, or if
> you have experience running AMBER in a Slurm environment where memory
> limits are enforced with cgroups. And I forgot to mention, that's AMBER
> 14 with the latest patches as of last week.
>
> Thanks!
> --
> Kilian
>
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Nov 07 2014 - 11:00:03 PST
Custom Search