On Thu, Jan 03, 2019 at 11:44:16PM +0000, David Howells wrote: > Fix cgroup refcounting by the following means: > > (1) Don't use PERCPU_REF_INIT_DEAD and percpu_ref_reinit(). Using this > causes a problem should kernfs_get_tree() create a superblock and then > fail to create a root inode. The superblock destructor will be invoked > before kernfs_get_tree() returns - and the refcount isn't reinit'd till > after that. > > To this end, cgroup_setup_root() no longer needs a ref_flags argument. > > (2) Provide a flag, CSS_CREATING, that is used to prevent concurrent access > to a cgroup root that is being set up and for which the superblock is > still being set up. This appears to be necessary to hide the fact that > the cgroup is accessible before the superblock is fully initialised. > > (3) Set CSS_CREATING in cgroup1_get_tree() on a new cgroup_root. This is > then cleared in cgroup_do_get_tree(). > > (4) cgroup_get_tree() is made to call cgroup_get() on the root it sets for a > v2 cgroup. Admittedly, this doesn't do anything because CSS_NO_REF is > set, but it future proofs it in case this is changed in future. > > (5) cgroup_fill_super() is created and passed to kernfs_get_tree() in the > kernfs_fs_context struct. This takes an extra ref on the root for the > superblock in the event that a superblock is created. > > struct cgroup_fs_context::root then holds a single ref on the root right > through till it is freed. > > Note that new_root is transferred into the cgroup_fs_context as > is_new_root, though this is probably unnecessary as it's only used to clear > CSS_CREATING - and no one else can gain access to the root until we've > cleared the flag. Umm... It'll need reordering; could you do CSS_CREATING parts on top of mainline, with the rest rediffed on top of that? We'll need to backport the fix, obviously... Re (4) - that's over the top for backporting. Sure, the analogue of the problem does exist in mainline - if (IS_ERR(dentry) || !new_sb) cgroup_put(&root->cgrp); can't tell kernfs_mount_ns() failing very early from kernfs_fill_super() failing and triggering ->kill_sb(). The latter case has a reference dropped by cgroup_kill_sb(); in the former nothing of that sort happens. Said that, we have the following cases: 1) early failure (anything up to sget_userns() in mainline). ERR_PTR() is returned, reference not consumed. 2) successful reuse of existing superblock. Normal dentry returned, new_sb set to false, reference not consumed. 3) new superblock, with failing kernfs_fill_super(). ERR_PTR() returned, new_sb set to true. Reference consumed. 4) new superblock, successful setup. Normal dentry returned, new_sb set to true. Reference consumed. Cases (2) and (4) (!IS_ERR(dentry)) are fine - reference is consumed iff new_sb is true. Case (1) is also fine - reference is not consumed. Case (3) is where we go wrong - the test comes up with "do cgroup_put()", which we don't want. Note that case (1) leaves new_sb uninitialized. And *that* is the root cause of this shit - if we initialized it (in cgroup_do_mount()) to false, the test would've been simply "if (!new_sb)". In all four cases. That's precisely the same kind of bug as the one fixed in 7b745a4e4051 ("unfuck sysfs_mount()") last May...