On Sat, Jan 05, 2019 at 06:37:03AM +0000, Al Viro wrote: > Cases (2) and (4) (!IS_ERR(dentry)) are fine - reference is consumed > iff new_sb is true. Case (1) is also fine - reference is not consumed. > Case (3) is where we go wrong - the test comes up with "do cgroup_put()", > which we don't want. > > Note that case (1) leaves new_sb uninitialized. And *that* is the > root cause of this shit - if we initialized it (in cgroup_do_mount()) > to false, the test would've been simply "if (!new_sb)". In all > four cases. > > That's precisely the same kind of bug as the one fixed in > 7b745a4e4051 ("unfuck sysfs_mount()") last May... Egads... That's not all there is in there; kernfs_node_dentry() failures are... not handled well. To start with, kernfs_node_dentry() leaks. Consider this: struct dentry *kernfs_node_dentry(struct kernfs_node *kn, struct super_block *sb) { ... dentry = dget(sb->s_root); /* Check if this is the root kernfs_node */ if (!kn->parent) return dentry; OK, it might (and normally will) return an extra reference to a dentry on the given superblock. However, knparent = find_next_ancestor(kn, NULL); if (WARN_ON(!knparent)) return ERR_PTR(-EINVAL); right after that will return ERR_PTR() *and* acquire an extra reference to the root dentry of given superblock. Then we have a loop: do { struct dentry *dtmp; struct kernfs_node *kntmp; if (kn == knparent) return dentry; kntmp = find_next_ancestor(kn, knparent); if (WARN_ON(!kntmp)) return ERR_PTR(-EINVAL); dtmp = lookup_one_len_unlocked(kntmp->name, dentry, strlen(kntmp->name)); Get an extra reference to child of dentry or get ERR_PTR() dput(dentry); ... and drop the parent in either case. if (IS_ERR(dtmp)) return dtmp; knparent = kntmp; dentry = dtmp; } while (true); So normally we acquire an return an extra reference to a dentry on our superblock. In case of error returned by lookup_one_len_unlocked() we pass that error along; no effect on refcounts. And in case of that NULL from find_next_ancestor() we return an error *and* leave behind an extra reference to some dentry on the superblock. With no way for caller to tell which dentry would that be. IOW, these two if (WARN_ON(!...)) should do dput(dentry) as well, but the trouble doesn't end there. These, after all, appear to be "never can happen" cases; failure of lookup_one_len_unlocked() is really possible (with -ENOMEM, if nothing else). And the caller (cgroup_do_mount()) is doing this: if (!IS_ERR(dentry) && ns != &init_cgroup_ns) { struct dentry *nsdentry; struct cgroup *cgrp; mutex_lock(&cgroup_mutex); spin_lock_irq(&css_set_lock); cgrp = cset_cgroup_from_root(ns->root_cset, root); spin_unlock_irq(&css_set_lock); mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb); dput(dentry); dentry = nsdentry; } Here dentry is equal to the root of our superblock. The reference to dentry is held, so's an active reference to superblock in question and so is ->s_umount of that superblock. In success case we end up acquiring an extra reference to dentry on the same superblock, dropping the reference we used to hold on the original and use that new dentry instead. In failure case, though, we end up with * no extra references to any dentries on that superblock * active reference held to the superblock itself * ->s_umount held on it ... and the error is returned to caller and propagated back to caller of ->mount(). Which has no way to tell which superblock have we leaked (locked, at that). IOW, deactivate_locked_super() was missing not just in David's rewrite - it's needed in the mainline as well...