On Mon, Aug 19, 2024 at 07:59:34PM -0400, Kent Overstreet wrote: > On Mon, Aug 19, 2024 at 03:58:26PM GMT, Paul E. McKenney wrote: > > I still remain quite skeptical of this one, but it has improved since > > June's version. > > > > Responses inline. > > > > Thanx, Paul > > > > In the meantime, thank you for avoiding reaching into RCU's and SRCU's > > innards. This makes it reasonable to have you put this file into your > > code, at least initially. That way, you get what you want *now* and us > > RCU guys are not committing to maintain it before it is ready for us to > > do so. > > The RCU interfaces do force a lot of function calls for things that > should be inlined though, and the gratuitious interface fragmentation > between RCU and SRCU is... annoying. I have gotten requests for a variant of SRCU that dispenses with the read-side memory barriers, which would allow you to dispense with RCU. Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to the _nmisafe() variants. There is always a price, and in this case the price would be that srcu_read_lock_lite() and srcu_read_unlock_lite() be restricted to code regions where RCU is watching. But from what I know, fs code must already abide by that restriction. Would that help? > > I still have strong misgivings about radix trees and cache locality. > > Poor cache locality on the other side of the grace period can result > > in OOMs during callback-flooding events, hence the move of kfree_rcu() > > and friends to pages of pointers. And, as you note below, radix trees > > don't merge nicely. > > Cache locality where? > > On the enqueue side, which is the fastpath, this uses a cursor - we're > not walking the radix tree every time. > > On the processing side, where we're kfree_bulk()ing entire radix tree > nodes, the radix tree is going to have better cache locality than a list > of pages. Interesting. What testing or analysis did you do in the course of arriving at this conclusion? In the meantime I will assume that the analysis involves your radix-tree nodes being more than one page in size. > > The advantage of synchronize_rcu() is that it can proceed without > > memory allocation. If you block on allocation, even of a 16-byte > > rcu_head structure, you can go into OOM-induced deadlock. > > > > It might well make sense to do an rcu_head allocation with GFP flags > > that try reasonably hard, but still allow failure before falling all > > the way back to synchronize_rcu(). (And for all I know, this might have > > been tested and found wanting, but Uladzislau Rezki (CCed) would know.) > > But a hard wait on that allocation is asking for trouble. > > That's reasonable - as long as we're trying the 16 byte allocation > before doing a synchronize_rcu(). It might or might not be reasonable, depending on what is going on in the rest of the system. The current kfree_rcu() code can run the system out of full pages, but it won't impede other code allocating smaller blocks of memory. We could easily change it to allocate individual rcu_head structures, but doing so would not necessarily be a win in OOM conditions, again, depending on what else is going on. We are very likely to add this, but also very likely to have a "chicken switch" to allow the system administrator to control it. But I freely concede that if your radix tree is using multi-page nodes, then your code might well have greater need to allocate individual rcu_head structures than does the current kfree_rcu() implementation. > > There is work underway to make the current callback lists take advantage > > of expedited grace periods, transparent to the caller. This allows > > the shrinker (or whatever) to speed up everything by simply invoking > > synchronize_rcu_expedited(). This speedup includes callback processing > > because it avoids "bubbles" in the callback processing that can occur > > when the next grace period has not yet completed, but all callbacks from > > earlier grace periods have been invoked. > > Glad to here, that was first on my list when adding a shrinker to this > code. Glad you approve. This has been on my list for quite some time, and we now have thing in place to make it easy. Well, easier, anyway. > > > - RCU_PENDING_CALL_RCU_FN > > > > > > Accelerated backend for call_rcu() - pending callbacks are tracked in > > > a radix tree to eliminate linked list overhead. > > > > But to add radix-tree overhead. And to eliminate the ability to do O(1) > > list merges. Is this really a win? > > As mentioned, there's a cursor so we're not adding radix-tree overhead > to the fast path, and there's no need to merge lists for expired objects > - that's all handled fine. I took a quick look at __rcu_pending_enqueue() and my eyes are still bleeding. Concatenating linked list of pages is way simpler, way faster, and way more robust. > But yes, using it for call_rcu() would need more performance > justification. I haven't seen workloads where call_rcu() performance > particularly matters, so it's not something I'm pushing for - I included > that because it's minimal code and other people might know of workloads > where we do want it. Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback functions usually access the memory block passed to them, which means that they are incurring that per-element cache miss in any case. > > > Ideally we would be getting a callback every time a grace period > > > completes for which we have objects, but that would require multiple > > > rcu_heads in flight, and since the number of gp sequence numbers with > > > uncompleted callbacks is not bounded, we can't do that yet. > > > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > > serve this purpose? After all, the case that causes trouble is the one > > where lots of things are being very frequently queued. > > No, because that's unpredictable, and we don't process pending items in > enqueue() unless we know we can sleep (i.e., we don't do it if the > caller is latency sensitive). It is possible to do this in an extremely lightweight fashion in the common case. Thanx, Paul