First RFC, feel free to ignore for now if too busy with merge window. Also in git: https://git.kernel.org/vbabka/l/slab-kfree_rcu-destroy-v1r0 Based on slab/for-next: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next Since SLOB was removed, we have allowed kfree_rcu() for objects allocated from any kmem_cache in addition to kmalloc(). Recently we have attempted to replace existing call_rcu() usage with kfree_rcu() where the callback is a plain kmem_cache_free(), in a series by Julia Lawall [1]. Jakub Kicinski pointed out [2] this was tried already in batman-adv but had to be reverted due to kmem_cache_destroy() failing due to objects remaining in the cache, despite rcu_barrier() being used. Jason Donenfeld found the culprit [3] being a35d16905efc ("rcu: Add basic support for kfree_rcu() batching") causing rcu_barrier() to be insufficient. This was never a problem for kfree_rcu() usage on kmalloc() objects as the kmalloc caches are never destroyed, but arbitrary caches can be, e.g. due to module unload. Out of the possible solutions collected by Paul McKenney [4] the most appealing to me is "kmem_cache_destroy() lingers for kfree_rcu()" as it adds no additional concerns to kfree_rcu() users. We already have the precedence in some parts of the kmem_cache cleanup being done asynchronously for SLAB_TYPESAFE_BY_RCU caches. This series implements the necessary changes to the slab allocator, mainly that if there are objects remaining in the cache when kmem_cache_destroy() is called it is assumed that it's due to pending kfree_rcu(), and a asynchronous work is scheduled that performs the necessary barrier and then tries again. If objects remain after the barrier, the usual warning is reported and the cache remains undestroyed. Notably the proper barrier doesn't yet exist so hopefully Paul or the RCU team can help here :) it should be able to make it pessimistic as it won't hold up anything but the work item. Some downsides of this approach for debugging exist but should be small enough: - we can no longer report the stack trace leading to a premature kmem_cache_destroy(), but arguably that's not that interesting as the allocation traces and other details about the remaining objects, which are still reported, just a bit later. - objects that are freed after kmem_cache_destroy() but before the work item proceeds with the destroy after the barrier, are technically bugs, but we won't be able to catch them unless we add some checks into the freeing hotpaths. It's not worth it, IMHO. There is also a bunch of preliminary steps. The potentially visible one is that sysfs and debugfs directories are now removed immediately during kmem_cache_destroy() - previously this would be delayed for SLAB_TYPESAFE_BY_RCU caches or left around forever if remaining objects were detected. The extra delay by asynchronous destroy due to kfree_rcu() could mean that a module unload/load cycle could create a new instance of the cache which would fail to create these directories - a concern raised by Paul. The immediate removal is the simplest solution (compared to e.g. renaming the directories) and should not make debugging harder - while it won't be possible to check debugfs for allocation traces of leaked objects, they are listed with more detail in dmesg anyway. [1] https://lore.kernel.org/all/20240609082726.32742-1-Julia.Lawall@xxxxxxxx/ [2] https://lore.kernel.org/all/20240612143305.451abf58@xxxxxxxxxx/ [3] https://lore.kernel.org/all/Zmo9-YGraiCj5-MI@xxxxxxxxx/ [4] https://docs.google.com/document/d/1v0rcZLvvjVGejT3523W0rDy_sLFu2LWc_NR3fQItZaA/edit To: Paul E. McKenney <paulmck@xxxxxxxxxx> To: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> To: Josh Triplett <josh@xxxxxxxxxxxxxxxx> To: Boqun Feng <boqun.feng@xxxxxxxxx> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> CC: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx> Cc: Zqiang <qiang.zhang1211@xxxxxxxxx> Cc: Julia Lawall <Julia.Lawall@xxxxxxxx> Cc: Jakub Kicinski <kuba@xxxxxxxxxx> Cc: Jason A. Donenfeld <Jason@xxxxxxxxx> Cc: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx> To: Christoph Lameter <cl@xxxxxxxxx> To: David Rientjes <rientjes@xxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx> Cc: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> Cc: linux-mm@xxxxxxxxx Cc: linux-kernel@xxxxxxxxxxxxxxx Cc: rcu@xxxxxxxxxxxxxxx Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> --- Vlastimil Babka (6): mm, slab: make caches with refcount of 0 unmergeable mm, slab: always maintain per-node slab and object count mm, slab: unlink sysfs and debugfs immediately mm, slab: simplify kmem_cache_release() mm, slab: asynchronously destroy caches with outstanding objects kunit, slub: add test_kfree_rcu() lib/slub_kunit.c | 22 ++++++++++ mm/slab.h | 4 +- mm/slab_common.c | 121 +++++++++++++++++++++++++++++++++++-------------------- mm/slub.c | 58 ++++++++++++-------------- 4 files changed, 129 insertions(+), 76 deletions(-) --- base-commit: 436381eaf2a423e60fc8340399f7d2458091b383 change-id: 20240715-b4-slab-kfree_rcu-destroy-85dd2b2ded92 Best regards, -- Vlastimil Babka <vbabka@xxxxxxx>