Re: [PATCH 3/5] cgroup: Fix refcounting

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 5 Jan 2019 16:05:24 +0000

On Sat, Jan 05, 2019 at 06:37:03AM +0000, Al Viro wrote:

> Cases (2) and (4) (!IS_ERR(dentry)) are fine - reference is consumed
> iff new_sb is true.  Case (1) is also fine - reference is not consumed.
> Case (3) is where we go wrong - the test comes up with "do cgroup_put()",
> which we don't want.
> 
> Note that case (1) leaves new_sb uninitialized.  And *that* is the
> root cause of this shit - if we initialized it (in cgroup_do_mount())
> to false, the test would've been simply "if (!new_sb)".  In all
> four cases.
> 
> That's precisely the same kind of bug as the one fixed in
> 7b745a4e4051 ("unfuck sysfs_mount()") last May...

Egads...  That's not all there is in there; kernfs_node_dentry() failures
are... not handled well.

To start with, kernfs_node_dentry() leaks.  Consider this:
struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
                                  struct super_block *sb)
{
...
        dentry = dget(sb->s_root);

        /* Check if this is the root kernfs_node */
        if (!kn->parent)
                return dentry;
OK, it might (and normally will) return an extra reference to a dentry on
the given superblock.  However,
        knparent = find_next_ancestor(kn, NULL);
        if (WARN_ON(!knparent))
                return ERR_PTR(-EINVAL);
right after that will return ERR_PTR() *and* acquire an extra reference to
the root dentry of given superblock.  Then we have a loop:
        do {
                struct dentry *dtmp;
                struct kernfs_node *kntmp;

                if (kn == knparent)
                        return dentry;
                kntmp = find_next_ancestor(kn, knparent);
                if (WARN_ON(!kntmp))
                        return ERR_PTR(-EINVAL);
                dtmp = lookup_one_len_unlocked(kntmp->name, dentry,
                                               strlen(kntmp->name));
Get an extra reference to child of dentry or get ERR_PTR()
                dput(dentry);
... and drop the parent in either case.
                if (IS_ERR(dtmp))
                        return dtmp;
                knparent = kntmp;
                dentry = dtmp;
        } while (true);

So normally we acquire an return an extra reference to a dentry on our
superblock.  In case of error returned by lookup_one_len_unlocked()
we pass that error along; no effect on refcounts.  And in case of
that NULL from find_next_ancestor() we return an error *and* leave
behind an extra reference to some dentry on the superblock.  With
no way for caller to tell which dentry would that be.

IOW, these two if (WARN_ON(!...)) should do dput(dentry) as well,
but the trouble doesn't end there.  These, after all, appear to
be "never can happen" cases; failure of lookup_one_len_unlocked()
is really possible (with -ENOMEM, if nothing else).  And the caller
(cgroup_do_mount()) is doing this:
        if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
                struct dentry *nsdentry;
                struct cgroup *cgrp;

                mutex_lock(&cgroup_mutex);
                spin_lock_irq(&css_set_lock);

                cgrp = cset_cgroup_from_root(ns->root_cset, root);

                spin_unlock_irq(&css_set_lock);
                mutex_unlock(&cgroup_mutex);

                nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
                dput(dentry);
                dentry = nsdentry;
        }

Here dentry is equal to the root of our superblock.  The reference to
dentry is held, so's an active reference to superblock in question and
so is ->s_umount of that superblock.

In success case we end up acquiring an extra reference to dentry on
the same superblock, dropping the reference we used to hold on the
original and use that new dentry instead.

In failure case, though, we end up with
	* no extra references to any dentries on that superblock
	* active reference held to the superblock itself
	* ->s_umount held on it
... and the error is returned to caller and propagated back to
caller of ->mount().  Which has no way to tell which superblock
have we leaked (locked, at that).  IOW, deactivate_locked_super()
was missing not just in David's rewrite - it's needed in the
mainline as well...