On Mon, Aug 19, 2024 at 03:58:26PM GMT, Paul E. McKenney wrote: > I still remain quite skeptical of this one, but it has improved since > June's version. > > Responses inline. > > Thanx, Paul > > In the meantime, thank you for avoiding reaching into RCU's and SRCU's > innards. This makes it reasonable to have you put this file into your > code, at least initially. That way, you get what you want *now* and us > RCU guys are not committing to maintain it before it is ready for us to > do so. The RCU interfaces do force a lot of function calls for things that should be inlined though, and the gratuitious interface fragmentation between RCU and SRCU is... annoying. > I still have strong misgivings about radix trees and cache locality. > Poor cache locality on the other side of the grace period can result > in OOMs during callback-flooding events, hence the move of kfree_rcu() > and friends to pages of pointers. And, as you note below, radix trees > don't merge nicely. Cache locality where? On the enqueue side, which is the fastpath, this uses a cursor - we're not walking the radix tree every time. On the processing side, where we're kfree_bulk()ing entire radix tree nodes, the radix tree is going to have better cache locality than a list of pages. > The advantage of synchronize_rcu() is that it can proceed without > memory allocation. If you block on allocation, even of a 16-byte > rcu_head structure, you can go into OOM-induced deadlock. > > It might well make sense to do an rcu_head allocation with GFP flags > that try reasonably hard, but still allow failure before falling all > the way back to synchronize_rcu(). (And for all I know, this might have > been tested and found wanting, but Uladzislau Rezki (CCed) would know.) > But a hard wait on that allocation is asking for trouble. That's reasonable - as long as we're trying the 16 byte allocation before doing a synchronize_rcu(). > There is work underway to make the current callback lists take advantage > of expedited grace periods, transparent to the caller. This allows > the shrinker (or whatever) to speed up everything by simply invoking > synchronize_rcu_expedited(). This speedup includes callback processing > because it avoids "bubbles" in the callback processing that can occur > when the next grace period has not yet completed, but all callbacks from > earlier grace periods have been invoked. Glad to here, that was first on my list when adding a shrinker to this code. > > - RCU_PENDING_CALL_RCU_FN > > > > Accelerated backend for call_rcu() - pending callbacks are tracked in > > a radix tree to eliminate linked list overhead. > > But to add radix-tree overhead. And to eliminate the ability to do O(1) > list merges. Is this really a win? As mentioned, there's a cursor so we're not adding radix-tree overhead to the fast path, and there's no need to merge lists for expired objects - that's all handled fine. But yes, using it for call_rcu() would need more performance justification. I haven't seen workloads where call_rcu() performance particularly matters, so it's not something I'm pushing for - I included that because it's minimal code and other people might know of workloads where we do want it. > > Ideally we would be getting a callback every time a grace period > > completes for which we have objects, but that would require multiple > > rcu_heads in flight, and since the number of gp sequence numbers with > > uncompleted callbacks is not bounded, we can't do that yet. > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > serve this purpose? After all, the case that causes trouble is the one > where lots of things are being very frequently queued. No, because that's unpredictable, and we don't process pending items in enqueue() unless we know we can sleep (i.e., we don't do it if the caller is latency sensitive).