Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

Aditya Kali <adityakali@xxxxxxxxxx> · Thu, 31 Jul 2014 12:48:43 -0700

On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn@xxxxxxxxxx> wrote:
> Quoting Aditya Kali (adityakali@xxxxxxxxxx):
>> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>>
>
> This is fine and I'm not looking to bikeshed, but am wondering - did
> you consider any other ways beside unshare (i.e. a new mount option
> to cgroupfs)?  If so, do you have a list of the downsides of those?
> (I mainly ask bc clone flags are still a scarce commodity)
>

I did consider couple of other ways:

(1) having a cgroup.ns_root (or something) cgroup file. If this value
is '1', it would mean that all processes it and its descendant cgroups
will have their cgroup paths in /proc/self/cgroup terminated at this
cgroup.
 For ex:
[A] --> [B] --> C
    | --> [D] --> E

[A], [B] and [D] has cgroup.ns_root = 1.
* all processes in cgroup C & E will see their cgroup path as /C and
/E respectively
* all processes in cgroup B & D will see their own cgroup path as /

In this model, its easy to know what to show if process is looking at
its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
looking at other process's /proc/<pid>/cgroup. We may be able to come
up with some hacky way read correct value, but depending on the
cgroupfs mount, it may not make sense.
One other major drawback of this approach is that "every" process in
the cgroup will now get a restricted view. i.e., you cannot change
cgroups without affecting your view. And this is undesirable for
administrative processes.

(2) Another idea that I didn't pursue further (and is a bit hacky as
above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
in cgroup.ns_procs will have their /proc/self/cgroup restricted).
Writing a pid to cgroup.ns_procs implies that you are writing it to
cgroup.procs too. But, not vise-versa. So, you could move yourself in
another cgroup by writing your pid in cgroup.procs, but not in
cgroup.ns_procs, thus preventing from getting "rooted". I This was to
solve administrative process issue in the above appraoch. But I think
this is very clunky too and I find semantics for this approach to be
non-intuitive. It almost looks like moving towards a separate "ns"
subsystem. But as we already know, its a path to failure.

I didn't think of using a mount option. I imagine the mount option
(something like -o root=/bathjobs/container_1) could be used to
restrict the visibility of cgroupfs inside the container's mount
namespace. i.e., the value you read from /proc/<pid>/cgroup now
depends on what mount namespace you are in. Its similar to cgroup
namespace, but just that the cgroupns_root is now stored in the
'struct mnt_namespace' instead of a separate 'struct
cgroup_namespace'. But, since mount namespace on creation inherits
mounts from its parent, the first cgroupfs mount in a mount namespace
is now treated specially. Also, its not possible to restrict cgroups
without mount namespace now. This is interesting and may not be too
bad. I am willing to give this a try. But I feel the cgroup namespace
approach fits well in-line with other namespaces where it does one
thing - virtualize the view of /proc/<pid>/cgroup file for processes
inside the namespace. The semantics are more intuitive as they are
similar to other namespaces.

Thanks,

>> Signed-off-by: Aditya Kali <adityakali@xxxxxxxxxx>
>
> Acked-by: Serge E. Hallyn <serge.hallyn@xxxxxxxxxx>
>
>> ---
>>  include/uapi/linux/sched.h | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 34f9d73..2f90d00 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -21,8 +21,7 @@
>>  #define CLONE_DETACHED               0x00400000      /* Unused, ignored */
>>  #define CLONE_UNTRACED               0x00800000      /* set if the tracing process can't force CLONE_PTRACE on this clone */
>>  #define CLONE_CHILD_SETTID   0x01000000      /* set the TID in the child */
>> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
>> -   and is now available for re-use. */
>> +#define CLONE_NEWCGROUP              0x02000000      /* New cgroup namespace */
>>  #define CLONE_NEWUTS         0x04000000      /* New utsname group? */
>>  #define CLONE_NEWIPC         0x08000000      /* New ipcs */
>>  #define CLONE_NEWUSER                0x10000000      /* New user namespace */
>> --
>> 2.0.0.526.g5318336
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
>> https://lists.linuxfoundation.org/mailman/listinfo/containers

-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html