Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 19 Jun 2024 11:28:13 +0200

On 6/18/24 7:53 PM, Paul E. McKenney wrote:
> On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote:
>> On 6/18/24 6:48 PM, Paul E. McKenney wrote:
>> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote:
>> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote:
>> >> > >> +
>> >> > >> +	s = container_of(work, struct kmem_cache, async_destroy_work);
>> >> > >> +
>> >> > >> +	// XXX use the real kmem_cache_free_barrier() or similar thing here
>> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i
>> >> > > wanted to avoid initially.
>> >> > 
>> >> > I wanted to avoid new API or flags for kfree_rcu() users and this would
>> >> > be achieved. The barrier is used internally so I don't consider that an
>> >> > API to avoid. How difficult is the implementation is another question,
>> >> > depending on how the current batching works. Once (if) we have sheaves
>> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might
>> >> > also look different and hopefully easier. So maybe it's not worth to
>> >> > invest too much into that barrier and just go for the potentially
>> >> > longer, but easier to implement?
>> >> > 
>> >> Right. I agree here. If the cache is not empty, OK, we just defer the
>> >> work, even we can use a big 21 seconds delay, after that we just "warn"
>> >> if it is still not empty and leave it as it is, i.e. emit a warning and
>> >> we are done.
>> >> 
>> >> Destroying the cache is not something that must happen right away. 
>> > 
>> > OK, I have to ask...
>> > 
>> > Suppose that the cache is created and destroyed by a module and
>> > init/cleanup time, respectively.  Suppose that this module is rmmod'ed
>> > then very quickly insmod'ed.
>> > 
>> > Do we need to fail the insmod if the kmem_cache has not yet been fully
>> > cleaned up?
>> 
>> We don't have any such link between kmem_cache and module to detect that, so
>> we would have to start tracking that. Probably not worth the trouble.
> 
> Fair enough!
> 
>> >  If not, do we have two versions of the same kmem_cache in
>> > /proc during the overlap time?
>> 
>> Hm could happen in /proc/slabinfo but without being harmful other than
>> perhaps confusing someone. We could filter out the caches being destroyed
>> trivially.
> 
> Or mark them in /proc/slabinfo?  Yet another column, yay!!!  Or script
> breakage from flagging the name somehow, for example, trailing "/"
> character.

Yeah I've been resisting such changes to the layout and this wouldn't be
worth it, apart from changing the name itself but not in a dangerous way
like with "/" :)

>> Sysfs and debugfs might be more problematic as I suppose directory names
>> would clash. I'll have to check... might be even happening now when we do
>> detect leaked objects and just leave the cache around... thanks for the
>> question.
> 
> "It is a service that I provide."  ;-)
> 
> But yes, we might be living with it already and there might already
> be ways people deal with it.

So it seems if the sysfs/debugfs directories already exist, they will
silently not be created. Wonder if we have such cases today already because
caches with same name exist. I think we do with the zsmalloc using 32 caches
with same name that we discussed elsewhere just recently.

Also indeed if the cache has leaked objects and won't be thus destroyed,
these directories indeed stay around, as well as the slabinfo entry, and can
prevent new ones from being created (slabinfo lines with same name are not
prevented).

But it wouldn't be great to introduce this possibility to happen for the
temporarily delayed removal due to kfree_rcu() and a module re-insert, since
that's a legitimate case and not buggy state due to leaks.

The debugfs directory we could remove immediately before handing over to the
scheduled workfn, but if it turns out there was a leak and the workfn leaves
the cache around, debugfs dir will be gone and we can't check the
alloc_traces/free_traces files there (but we have the per-object info
including the traces in the dmesg splat).

The sysfs directory is currently removed only with the whole cache being
destryed due to sysfs/kobject lifetime model. I'd love to untangle it for
other reasons too, but haven't investigated it yet. But again it might be
useful for sysfs dir to stay around for inspection, as for the debugfs.

We could rename the sysfs/debugfs directories before queuing the work? Add
some prefix like GOING_AWAY-$name. If leak is detected and cache stays
forever, another rename to LEAKED-$name. (and same for the slabinfo). But
multiple ones with same name might pile up, so try adding a counter then?
Probably messy to implement, but perhaps the most robust in the end? The
automatic counter could also solve the general case of people using same
name for multiple caches.

Other ideas?

Thanks,
Vlastimil

> 
> 							Thanx, Paul
> 
>> >> > > Since you do it asynchronous can we just repeat
>> >> > > and wait until it a cache is furry freed?
>> >> > 
>> >> > The problem is we want to detect the cases when it's not fully freed
>> >> > because there was an actual read. So at some point we'd need to stop the
>> >> > repeats because we know there can no longer be any kfree_rcu()'s in
>> >> > flight since the kmem_cache_destroy() was called.
>> >> > 
>> >> Agree. As noted above, we can go with 21 seconds(as an example) interval
>> >> and just perform destroy(without repeating).
>> >> 
>> >> --
>> >> Uladzislau Rezki
>>