Re: [AMBER-Developers] [campuschampions] Amber/cgroup/GPU (fwd) from Thomas Cheatham on 2014-11-07 (Amber Developers Archive Nov 2014)

From: Thomas Cheatham <tec3.utah.edu>
Date: Fri, 7 Nov 2014 11:37:12 -0700 (MST)

Any body have some ideas about this? Basically "cgroups" are a way to
create a virtual container, one which you can restrict memory to a
sub-process, etc (i.e. for example to partition a node into two
independent halves). Thanks! --tom

---------- Forwarded message ----------
Date: Fri, 7 Nov 2014 11:15:33 -0700
From: Kilian Cavalotti <kilian.stanford.edu>
To: Thomas Cheatham <tec3.utah.edu>
Subject: Re: [campuschampions] Amber/cgroup/GPU

Hi Thomas,

> Yes. I am an AMBER developer and also a domain champion for molecular
> dynamics, whatever that means...

> Contact me offlist with issues and we will try to help out. --tom

Awesome! Thanks a lot for getting back to me.

So here's the deal: we don't exactly know what the problem is right now,
but we just noticed that running AMBER simulations (even just the ones
from the benchmark suite) inside Slurm lead to the job being killed for
exceeding its memory limit.

Our Slurm config uses cgroups for limits enforcement, including memory
(see http://slurm.schedmd.com/cgroups.html and
http://slurm.schedmd.com/cgroup.conf.html for details).

We tried to run the PME/JAC_production_NVE_4fs benchmark on 2 CPUs and 2
GPUs, and it fails even with a memory limit set to 256GB (the amount of
physical memory on the node). When run outside of Slurm, it works fine.

When run on only 1 CPU, it works fine too, whatever the number of GPUs
is (but I don't know if multiple GPUs are actually used in that case).

>From my limited understanding so far, it looks like AMBER tries to
"probe" the amount of available memory when it starts. The problem is
that the cgroup memory subsystem does not limit those allocations, and
AMBER is allowed to allocate all of the node physical memory. But when
it tries to actually use it, the resident memory amount exceeds the
limit at some point and it gets killed.

The leads to messages like:

slurmstepd: Job 1007091 exceeded virtual memory limit (297308404 >
262144000), being killed
slurmstepd: Exceeded job memory limit

And yes, 262144000 is in KB, so it's about 256GB!

So I was wondering if you could shed any light on this problem, or if
you have experience running AMBER in a Slurm environment where memory
limits are enforced with cgroups. And I forgot to mention, that's AMBER
14 with the latest patches as of last week.

Thanks!

-- 
Kilian
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers

Received on Fri Nov 07 2014 - 11:00:02 PST