On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > I could simply use the same general approach that I use within RCU > itself, which currently has absolutely no idea how much memory (if any) > that each callback will free. Especially given that some callbacks > free groups of memory blocks, while other free nothing. ;-) > > Alternatively, we could gather statistics on the amount of memory freed > by each callback and use that as an estimate. > > But we should instead step back and ask exactly what we are trying to > accomplish here, which just might be what Dave Chinner was getting at. > > At a ridiculously high level, reclaim is looking for memory to free. > Some read-only memory can often be dropped immediately on the grounds > that its data can be read back in if needed. Other memory can only be > dropped after being written out, which involves a delay. There are of > course many other complications, but this will do for a start. > > So, where does RCU fit in? > > RCU fits in between the two. With memory awaiting RCU, there is no need > to write anything out, but there is a delay. As such, memory waiting > for an RCU grace period is similar to memory that is to be reclaimed > after its I/O completes. > > One complication, and a complication that we are considering exploiting, > is that, unlike reclaimable memory waiting for I/O, we could often > (but not always) have some control over how quickly RCU's grace periods > complete. And we already do this programmatically by using the choice > between sychronize_rcu() and synchronize_rcu_expedited(). The question > is whether we should expedite normal RCU grace periods during reclaim, > and if so, under what conditions. > > You identified one potential condition, namely the amount of memory > waiting to be reclaimed. One complication with this approach is that RCU > has no idea how much memory each callback represents, and for call_rcu(), > there is no way for it to find out. For kfree_rcu(), there are ways, > but as you know, I am questioning whether those ways are reasonable from > a performance perspective. But even if they are, we would be accepting > more error from the memory waiting via call_rcu() than we would be > accepting if we just counted blocks instead of bytes for kfree_rcu(). You're _way_ overcomplicating this. The relevant thing to consider is the relative cost of __ksize() and kfree_rcu(). __ksize() is already pretty cheap, and with slab gone and space available in struct slab we can get it down to a single load. > Let me reiterate that: The estimation error that you are objecting to > for kfree_rcu() is completely and utterly unavoidable for call_rcu(). hardly, callsites manually freeing memory manually after an RCU grace period can do the accounting manually - if they're hot enough to matter, most aren.t and with memory allocation profiling coming, which also tracks # of allocations, we'll also have an easy way to spot those.