On Thu, Jun 13, 2024 at 02:31:53AM +0200, Jason A. Donenfeld wrote: > On Thu, Jun 13, 2024 at 01:31:57AM +0200, Jason A. Donenfeld wrote: > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > when the callback only performs kmem_cache_free. Use > > > > > kfree_rcu() directly. > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > This semantic patch is designed to ignore cases where the callback > > > > > function is used in another way. > > > > > > > > How does the discussion on: > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing@xxxxxxxxx/ > > > > reflect on this series? IIUC we should hold off.. > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > where the kmem_cache is destroyed during module unload. > > > > > > OK, I might as well go through them... > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > Right, this has exactly the same pattern as the batman-adv issue: > > > > void wg_allowedips_slab_uninit(void) > > { > > rcu_barrier(); > > kmem_cache_destroy(node_cache); > > } > > > > I'll hold off on sending that up until this matter is resolved. > > BTW, I think this whole thing might be caused by: > > a35d16905efc ("rcu: Add basic support for kfree_rcu() batching") > > The commit message there mentions: > > There is an implication with rcu_barrier() with this patch. Since the > kfree_rcu() calls can be batched, and may not be handed yet to the RCU > machinery in fact, the monitor may not have even run yet to do the > queue_rcu_work(), there seems no easy way of implementing rcu_barrier() > to wait for those kfree_rcu()s that are already made. So this means a > kfree_rcu() followed by an rcu_barrier() does not imply that memory will > be freed once rcu_barrier() returns. > > Before that, a kfree_rcu() used to just add a normal call_rcu() into the > list, but with the function offset < 4096 as a special marker. So the > kfree_rcu() calls would be treated alongside the other call_rcu() ones > and thus affected by rcu_barrier(). Looks like that behavior is no more > since this commit. You might well be right, and thank you for digging into this! > Rather than getting rid of the batching, which seems good for > efficiency, I wonder if the right fix to this would be adding a > `should_destroy` boolean to kmem_cache, which kmem_cache_destroy() sets > to true. And then right after it checks `if (number_of_allocations == 0) > actually_destroy()`, and likewise on each kmem_cache_free(), it could > check `if (should_destroy && number_of_allocations == 0) > actually_destroy()`. This way, the work is delayed until it's safe to do > so. This might also mitigate other lurking bugs of bad code that calls > kmem_cache_destroy() before kmem_cache_free(). Here are the current options being considered, including those that are completely brain-dead: o Document current state. (Must use call_rcu() if module destroys slab of RCU-protected objects.) Need to review Julia's and Uladzislau's series of patches that change call_rcu() of slab objects to kfree_rcu(). o Make rcu_barrier() wait for kfree_rcu() objects. (This is surprisingly complex and will wait unnecessarily in some cases. However, it does preserve current code.) o Make a kfree_rcu_barrier() that waits for kfree_rcu() objects. (This avoids the unnecessary waits, but adds complexity to kfree_rcu(). This is harder than it looks, but could be done, for example by maintaining pairs of per-CPU counters and handling them in an SRCU-like fashion. Need some way of communicating the index, though.) (There might be use cases where both rcu_barrier() and kfree_rcu_barrier() would need to be invoked.) A simpler way to implement this is to scan all of the in-flight objects, and queue each (either separately or in bulk) using call_rcu(). This still has problems with kfree_rcu_mightsleep() under low-memory conditions, in which case there are a bunch of synchronize_rcu() instances waiting. These instances could use SRCU-like per-CPU arrays of counters. Or just protect the calls to synchronize_rcu() and the later frees with an SRCU reader, then have the other end call synchronize_srcu(). o Make the current kmem_cache_destroy() asynchronously wait for all memory to be returned, then complete the destruction. (This gets rid of a valuable debugging technique because in normal use, it is a bug to attempt to destroy a kmem_cache that has objects still allocated.) o Make a kmem_cache_destroy_rcu() that asynchronously waits for all memory to be returned, then completes the destruction. (This raises the question of what to is it takes a "long time" for the objects to be freed.) o Make a kmem_cache_free_barrier() that blocks until all objects in the specified kmem_cache have been freed. o Make a kmem_cache_destroy_wait() that waits for all memory to be returned, then does the destruction. This is equivalent to: kmem_cache_free_barrier(&mycache); kmem_cache_destroy(&mycache); Uladzislau has started discussions on the last few of these: https://lore.kernel.org/all/ZmnL4jkhJLIW924W@pc636/ I have also added this information to a Google Document for easier tracking: https://docs.google.com/document/d/1v0rcZLvvjVGejT3523W0rDy_sLFu2LWc_NR3fQItZaA/edit?usp=sharing Other thoughts? Thanx, Paul