Re: [PATCH 3/5] cgroup: Fix refcounting

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 5 Jan 2019 06:37:03 +0000

On Thu, Jan 03, 2019 at 11:44:16PM +0000, David Howells wrote:
> Fix cgroup refcounting by the following means:
> 
>  (1) Don't use PERCPU_REF_INIT_DEAD and percpu_ref_reinit().  Using this
>      causes a problem should kernfs_get_tree() create a superblock and then
>      fail to create a root inode.  The superblock destructor will be invoked
>      before kernfs_get_tree() returns - and the refcount isn't reinit'd till
>      after that.
> 
>      To this end, cgroup_setup_root() no longer needs a ref_flags argument.
> 
>  (2) Provide a flag, CSS_CREATING, that is used to prevent concurrent access
>      to a cgroup root that is being set up and for which the superblock is
>      still being set up.  This appears to be necessary to hide the fact that
>      the cgroup is accessible before the superblock is fully initialised.
> 
>  (3) Set CSS_CREATING in cgroup1_get_tree() on a new cgroup_root.  This is
>      then cleared in cgroup_do_get_tree().
> 
>  (4) cgroup_get_tree() is made to call cgroup_get() on the root it sets for a
>      v2 cgroup.  Admittedly, this doesn't do anything because CSS_NO_REF is
>      set, but it future proofs it in case this is changed in future.
> 
>  (5) cgroup_fill_super() is created and passed to kernfs_get_tree() in the
>      kernfs_fs_context struct.  This takes an extra ref on the root for the
>      superblock in the event that a superblock is created.
> 
>      struct cgroup_fs_context::root then holds a single ref on the root right
>      through till it is freed.
> 
> Note that new_root is transferred into the cgroup_fs_context as
> is_new_root, though this is probably unnecessary as it's only used to clear
> CSS_CREATING - and no one else can gain access to the root until we've
> cleared the flag.

Umm...  It'll need reordering; could you do CSS_CREATING parts on top of
mainline, with the rest rediffed on top of that?  We'll need to backport
the fix, obviously...

Re (4) - that's over the top for backporting.  Sure, the analogue of the
problem does exist in mainline - 
        if (IS_ERR(dentry) || !new_sb)
                cgroup_put(&root->cgrp);
can't tell kernfs_mount_ns() failing very early from kernfs_fill_super()
failing and triggering ->kill_sb().  The latter case has a reference
dropped by cgroup_kill_sb(); in the former nothing of that sort
happens.  Said that, we have the following cases:
	1) early failure (anything up to sget_userns() in mainline).
ERR_PTR() is returned, reference not consumed.
	2) successful reuse of existing superblock.  Normal dentry
returned, new_sb set to false, reference not consumed.
	3) new superblock, with failing kernfs_fill_super().
ERR_PTR() returned, new_sb set to true.  Reference consumed.
	4) new superblock, successful setup.  Normal dentry returned,
new_sb set to true.  Reference consumed.

Cases (2) and (4) (!IS_ERR(dentry)) are fine - reference is consumed
iff new_sb is true.  Case (1) is also fine - reference is not consumed.
Case (3) is where we go wrong - the test comes up with "do cgroup_put()",
which we don't want.

Note that case (1) leaves new_sb uninitialized.  And *that* is the
root cause of this shit - if we initialized it (in cgroup_do_mount())
to false, the test would've been simply "if (!new_sb)".  In all
four cases.

That's precisely the same kind of bug as the one fixed in
7b745a4e4051 ("unfuck sysfs_mount()") last May...