On 2024/8/13 0:57, Michal Koutný wrote:
Hello.
On Mon, Aug 12, 2024 at 07:37:46AM GMT, Chen Ridong <chenridong@xxxxxxxxxx> wrote:
cd /sys/fs/cgroup
echo '+pids' > cgroup.subtree_control
mkdir dlgt_grp_ns
echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
mkdir dlgt_grp_ns/dlgt_grp_ns1
echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
unshare -Cm /bin/bash
echo max > /dlgt_grp_ns1/pids.max // Permission denied
echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now
You could also have increased the ancestral limit (if there was any)
echo max > dlgt_grp_ns/pids.max // similarly allowed
If you're a root (or otherwise have sufficient permissions) and you can
_see_ an ancestral cgroup, you can write to its attributes according to
permissions. Thus the delegation works via cgroup ns (in)visibility but
cgroup ns root is visible on both sides of the boundary hence the extra
check.
I can imagine that a container runtime process could enter the target
cgroup ns while keeping visibility to the original cgroup ns and do some
tuning on it before it drops any pointers to the original cgroup ns and
exec's delegatee's workload. (But it's only my imagination to illustrate
that this may be a breaking change.)
Yes, this is a change. However, the process within the container can
turn on subsystems , they can also turn off subsystems , which may not
only affect the host machine but may also affect other containers. I
believe this poses a danger.
OTOH, I can see why this would be consistent with the migration rules
that exist between sides of cgroup ns, so this could work if it was
hidden behind (another) mount option like 'nsdelegate2' :-p
Hello, Michal, I am sorry I can not get your point.
Did you mean this issue should be fixed with another mount option
'nsdelegate2'? Or is the current patch acceptable?
@@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
* cgroup.procs, cgroup.threads and cgroup.subtree_control.
*/
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
- !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
- ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
+ ctx->ns != &init_cgroup_ns &&
+ (!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
+ (!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
+ ctx->ns->root_cset->dfl_cgrp == cgrp)))
return -EPERM;
Could you please also update the comment above, to describe the boundary
vs subtree delegation?
Thanks,
Michal
Sure.
Thanks
Ridong