On Thu, Jun 08, 2023 at 12:36:22PM -0400, Theodore Ts'o wrote: > On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote: > > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote: > > > Hm, it makes the API more complex and easier to mess with. Like what will happen > > > if the second part is never called? Or it's called without the first part being > > > called first? > > > > Bad things. > > > > Also, it doesn't fix the three other unregister_shrinker() calls in > > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount > > path. > > > > Those are just some of the unregister_shrinker() calls that have > > dynamic contexts that would also need this same fix; I haven't > > audited the 3 dozen other unregister_shrinker() calls around the > > kernel to determine if any of them need similar treatment, too. > > > > IOWs, this patchset is purely a band-aid to fix the reported > > regression, not an actual fix for the underlying problems caused by > > moving the shrinker infrastructure to SRCU protection. This is why > > I really want the SRCU changeover reverted. > > There's been so much traffic on linux-fsdevel so I missed this thread > until Darrick pointed it out to me today. (Thanks, Darrick!) > > From his description, and my quick read-through of this thread.... > I'm worried. > > Given that we're at -rc5 now, and the file system folks didn't get > consulted until fairly late in the progress, and the fact that this > may cause use-after-free problems that could lead to security issues, > perhaps we shoould consider reverting the SRCU changeover now, and try > again for the next merge window? Yes, please, because I think we can fix this in a much better way and make things a whole lot simpler at the same time. The root cause of the SRCU usage is that mm/memcg developers still haven't unified the non-memcg and the memcg based shrinker implementations. shrink_slab_memcg() doesn't require SRCU protection as it does not iterate the shrinker list at all; it requires *shrinker instance* lifetime protection. The problem is shrink_slab() doing the root memcg/global shrinker work - it iterates the shrinker list directly, and this is the list walk that SRCU is necessary for to "make shrinkers lockless" Going back to shrink_slab_memcg(), it does a lookup of the shrinker instance by idr_find(); idr_find() is a wrapper around radix_tree_lookup(), which means we can use RCU protection and reference counting to both validate the shrinker instance *and* guarantee that it isn't free from under us as we execute the shrinker. This requires, as I mentioned elsewhere in this thread, that the shrinker instance to be dynamically allocated, not embedded in other structures. Dynamically allocated shrinker instances and reference counting them means we can do this in shrink_slab_memcg() to ensure shrinker instance validity and lifetime rules are followed: rcu_read_lock() shrinker = idr_find(&shrinker_idr, i); if (!shrinker || !refcount_inc_not_zero(&shrinker->refcount)) { /* shrinker is being torn down */ clear_bit(i, info->map); rcu_read_unlock(); continue; } rcu_read_unlock(); /* do shrinker stuff */ if (refcount_dec_and_test(&shrinker->refcount)) { /* shrinker is being torn down, waiting for us */ wakeup(&shrinker->completion_wait); } /* unsafe to reference shrinker now */ And we enable the shrinker to run simply by: shrinker_register() { ..... /* allow the shrinker to run now */ refcount_set(shrinker->refcount, 1); return 0; } We then shut down the shrinker so we can tear it down like so: shrinker_unregister() { DECLARE_WAITQUEUE(wait, current); add_wait_queue_exclusive(shrinker->completion_wait, &wait); if (!refcount_dec_and_test(&shrinker->refcount))) { /* Wait for running instances to exit */ __set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } remove_wait_queue(wq, &wait); /* We own the entire shrinker instance now, start tearing it down */ ..... /* Free the shrinker itself after a RCU grace period expires */ kfree_rcu(shrinker, shrinker->rcu_head); } So, essentially we don't need SCRU at all to do lockless shrinker lookup for the memcg shrinker side of the fence, nor do we need synchronise_srcu() to wait for shrinker instances to finish running before we can tear stuff down. There is no global state in this at all; everything is per-shrinker instance. But SRCU is needed to protect the global shrinker list walk because it hasn't been converted to always use the root memcg and be iterated as if it is just another memcg. IOWs, using SRCU to protect the global shrinker list walk is effectively slapping a bandaid over a more fundamental problem that we've known about for a long time. So the first thing that has to be done here is unify the shrinker infrastructure under the memcg implementation. The global shrinker state should be tracked in the root memcg, just like any other memcg shrinker is tracked. If memcg's are not enabled, then we should be creating a dummy global memcg that a get_root_memcg() helper returns when memcgs are disabled to tracks all the global shrinker state. then shrink_slab just doesn't care about what memcg is passed to it, it just does the same thing. IOWs, shrink_slab gets entirely replaced by shrink_slab_memcg(), and the need for SRCU goes away entirely because shrinker instance lookups are all RCU+refcount protected. Yes, this requires we change how shrinker instances are instantiated by subsystems, but this is mostly simple transformation of existing code. But it doesn't require changing shrinker implementations, it doesn't require a new teardown API, and it doesn't change the context under which shrinkers might run. All the existing RCU protected stuff in the shrinker maps and memcgs can remain that way. We can also keep the shrinker rwsem/mutex for all the little internal bits of setup/teardown/non-rcu coordination that are needed; once the list iteration is lockless, there will be almost no contention on that lock at all... This isn't all that hard - it's just replicating a well known design pattern (i.e. refcount+RCU) we use all over the kernel combined with taking advantage of IDR being backed by RCU-safe infrastructure. If I had been cc'd on the original SRCU patches, this is exactly what I would have suggested as a better solution to the problem. We end up with cleaner, more robust and better performing infrastructure. This is far better than layering more complexity on top of what is really a poor foundation.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx