Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 9 Jun 2023 09:17:54 +1000

On Thu, Jun 08, 2023 at 12:36:22PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > > if the second part is never called? Or it's called without the first part being
> > > called first?
> > 
> > Bad things.
> > 
> > Also, it doesn't fix the three other unregister_shrinker() calls in
> > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> > path.
> > 
> > Those are just some of the unregister_shrinker() calls that have
> > dynamic contexts that would also need this same fix; I haven't
> > audited the 3 dozen other unregister_shrinker() calls around the
> > kernel to determine if any of them need similar treatment, too.
> > 
> > IOWs, this patchset is purely a band-aid to fix the reported
> > regression, not an actual fix for the underlying problems caused by
> > moving the shrinker infrastructure to SRCU protection.  This is why
> > I really want the SRCU changeover reverted.
> 
> There's been so much traffic on linux-fsdevel so I missed this thread
> until Darrick pointed it out to me today.  (Thanks, Darrick!)
> 
> From his description, and my quick read-through of this thread....
> I'm worried.
> 
> Given that we're at -rc5 now, and the file system folks didn't get
> consulted until fairly late in the progress, and the fact that this
> may cause use-after-free problems that could lead to security issues,
> perhaps we shoould consider reverting the SRCU changeover now, and try
> again for the next merge window?

Yes, please, because I think we can fix this in a much better way
and make things a whole lot simpler at the same time.

The root cause of the SRCU usage is that mm/memcg developers still
haven't unified the non-memcg and the memcg based shrinker
implementations. shrink_slab_memcg() doesn't require SRCU
protection as it does not iterate the shrinker list at all; it
requires *shrinker instance* lifetime protection. The problem is
shrink_slab() doing the root memcg/global shrinker work - it
iterates the shrinker list directly, and this is the list walk that
SRCU is necessary for to "make shrinkers lockless"

Going back to shrink_slab_memcg(), it does a lookup of the shrinker
instance by idr_find(); idr_find() is a wrapper around
radix_tree_lookup(), which means we can use RCU protection and
reference counting to both validate the shrinker instance *and*
guarantee that it isn't free from under us as we execute the
shrinker.

This requires, as I mentioned elsewhere in this thread, that the
shrinker instance to be dynamically allocated, not embedded in other
structures. Dynamically allocated shrinker instances and reference
counting them means we can do this in shrink_slab_memcg() to ensure
shrinker instance validity and lifetime rules are followed:

	rcu_read_lock()
	shrinker = idr_find(&shrinker_idr, i);
	if (!shrinker ||
	    !refcount_inc_not_zero(&shrinker->refcount)) {
		/* shrinker is being torn down */
		clear_bit(i, info->map);
		rcu_read_unlock();
		continue;
	}
	rcu_read_unlock();

	/* do shrinker stuff */

	if (refcount_dec_and_test(&shrinker->refcount)) {
		/* shrinker is being torn down, waiting for us */
		wakeup(&shrinker->completion_wait);
	}
	/* unsafe to reference shrinker now */

And we enable the shrinker to run simply by:

shrinker_register()
{
	.....
	/* allow the shrinker to run now */
	refcount_set(shrinker->refcount, 1);
	return 0;
}

We then shut down the shrinker so we can tear it down like so:

shrinker_unregister()
{
	DECLARE_WAITQUEUE(wait, current);

	add_wait_queue_exclusive(shrinker->completion_wait, &wait);
	if (!refcount_dec_and_test(&shrinker->refcount))) {
		/* Wait for running instances to exit */
		__set_current_state(TASK_UNINTERRUPTIBLE);
		schedule();
	}
	remove_wait_queue(wq, &wait);

	/* We own the entire shrinker instance now, start tearing it down */

	.....

	/* Free the shrinker itself after a RCU grace period expires */
	kfree_rcu(shrinker, shrinker->rcu_head);
}

So, essentially we don't need SCRU at all to do lockless shrinker
lookup for the memcg shrinker side of the fence, nor do we need
synchronise_srcu() to wait for shrinker instances to finish running
before we can tear stuff down. There is no global state in this at
all; everything is per-shrinker instance.

But SRCU is needed to protect the global shrinker list walk because
it hasn't been converted to always use the root memcg and be
iterated as if it is just another memcg.  IOWs, using SRCU to
protect the global shrinker list walk is effectively slapping a
bandaid over a more fundamental problem that we've known about for
a long time.

So the first thing that has to be done here is unify the shrinker
infrastructure under the memcg implementation. The global shrinker
state should be tracked in the root memcg, just like any other memcg
shrinker is tracked. If memcg's are not enabled, then we should be
creating a dummy global memcg that a get_root_memcg() helper returns
when memcgs are disabled to tracks all the global shrinker state.
then shrink_slab just doesn't care about what memcg is passed to it,
it just does the same thing.

IOWs, shrink_slab gets entirely replaced by shrink_slab_memcg(), and
the need for SRCU goes away entirely because shrinker instance
lookups are all RCU+refcount protected.

Yes, this requires we change how shrinker instances are instantiated
by subsystems, but this is mostly simple transformation of existing
code. But it doesn't require changing shrinker implementations, it
doesn't require a new teardown API, and it doesn't change the
context under which shrinkers might run.

All the existing RCU protected stuff in the shrinker maps and memcgs
can remain that way.  We can also keep the shrinker rwsem/mutex for
all the little internal bits of setup/teardown/non-rcu coordination
that are needed; once the list iteration is lockless, there will be
almost no contention on that lock at all...

This isn't all that hard - it's just replicating a well known design
pattern (i.e. refcount+RCU) we use all over the kernel combined with
taking advantage of IDR being backed by RCU-safe infrastructure.

If I had been cc'd on the original SRCU patches, this is exactly
what I would have suggested as a better solution to the problem. We
end up with cleaner, more robust and better performing
infrastructure. This is far better than layering more complexity on
top of what is really a poor foundation....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx