Re: [PATCH 6/9] rcu: rcu_pending

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 19, 2024 at 07:59:34PM -0400, Kent Overstreet wrote:
> On Mon, Aug 19, 2024 at 03:58:26PM GMT, Paul E. McKenney wrote:
> > I still remain quite skeptical of this one, but it has improved since
> > June's version.
> > 
> > Responses inline.
> > 
> > 							Thanx, Paul
> > 
> > In the meantime, thank you for avoiding reaching into RCU's and SRCU's
> > innards.  This makes it reasonable to have you put this file into your
> > code, at least initially.  That way, you get what you want *now* and us
> > RCU guys are not committing to maintain it before it is ready for us to
> > do so.
> 
> The RCU interfaces do force a lot of function calls for things that
> should be inlined though, and the gratuitious interface fragmentation
> between RCU and SRCU is... annoying.

I have gotten requests for a variant of SRCU that dispenses with the
read-side memory barriers, which would allow you to dispense with RCU.
Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to
the _nmisafe() variants.  There is always a price, and in this case the
price would be that srcu_read_lock_lite() and srcu_read_unlock_lite()
be restricted to code regions where RCU is watching.  But from what I
know, fs code must already abide by that restriction.

Would that help?

> > I still have strong misgivings about radix trees and cache locality.
> > Poor cache locality on the other side of the grace period can result
> > in OOMs during callback-flooding events, hence the move of kfree_rcu()
> > and friends to pages of pointers.  And, as you note below, radix trees
> > don't merge nicely.
> 
> Cache locality where?
> 
> On the enqueue side, which is the fastpath, this uses a cursor - we're
> not walking the radix tree every time.
> 
> On the processing side, where we're kfree_bulk()ing entire radix tree
> nodes, the radix tree is going to have better cache locality than a list
> of pages.

Interesting.  What testing or analysis did you do in the course of
arriving at this conclusion?  In the meantime I will assume that the
analysis involves your radix-tree nodes being more than one page in size.

> > The advantage of synchronize_rcu() is that it can proceed without
> > memory allocation.  If you block on allocation, even of a 16-byte
> > rcu_head structure, you can go into OOM-induced deadlock.
> > 
> > It might well make sense to do an rcu_head allocation with GFP flags
> > that try reasonably hard, but still allow failure before falling all
> > the way back to synchronize_rcu().  (And for all I know, this might have
> > been tested and found wanting, but Uladzislau Rezki (CCed) would know.)
> > But a hard wait on that allocation is asking for trouble.
> 
> That's reasonable - as long as we're trying the 16 byte allocation
> before doing a synchronize_rcu().

It might or might not be reasonable, depending on what is going on in the
rest of the system.  The current kfree_rcu() code can run the system out
of full pages, but it won't impede other code allocating smaller blocks
of memory.  We could easily change it to allocate individual rcu_head
structures, but doing so would not necessarily be a win in OOM conditions,
again, depending on what else is going on.

We are very likely to add this, but also very likely to have a "chicken
switch" to allow the system administrator to control it.

But I freely concede that if your radix tree is using multi-page nodes,
then your code might well have greater need to allocate individual
rcu_head structures than does the current kfree_rcu() implementation.

> > There is work underway to make the current callback lists take advantage
> > of expedited grace periods, transparent to the caller.  This allows
> > the shrinker (or whatever) to speed up everything by simply invoking
> > synchronize_rcu_expedited().  This speedup includes callback processing
> > because it avoids "bubbles" in the callback processing that can occur
> > when the next grace period has not yet completed, but all callbacks from
> > earlier grace periods have been invoked.
> 
> Glad to here, that was first on my list when adding a shrinker to this
> code.

Glad you approve.

This has been on my list for quite some time, and we now have thing in
place to make it easy.  Well, easier, anyway.

> > > - RCU_PENDING_CALL_RCU_FN
> > > 
> > >   Accelerated backend for call_rcu() - pending callbacks are tracked in
> > >   a radix tree to eliminate linked list overhead.
> > 
> > But to add radix-tree overhead.  And to eliminate the ability to do O(1)
> > list merges.  Is this really a win?
> 
> As mentioned, there's a cursor so we're not adding radix-tree overhead
> to the fast path, and there's no need to merge lists for expired objects
> - that's all handled fine.

I took a quick look at __rcu_pending_enqueue() and my eyes are still
bleeding.  Concatenating linked list of pages is way simpler, way faster,
and way more robust.

> But yes, using it for call_rcu() would need more performance
> justification. I haven't seen workloads where call_rcu() performance
> particularly matters, so it's not something I'm pushing for - I included
> that because it's minimal code and other people might know of workloads
> where we do want it.

Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback
functions usually access the memory block passed to them, which means
that they are incurring that per-element cache miss in any case.

> > > Ideally we would be getting a callback every time a grace period
> > > completes for which we have objects, but that would require multiple
> > > rcu_heads in flight, and since the number of gp sequence numbers with
> > > uncompleted callbacks is not bounded, we can't do that yet.
> > 
> > Doesn't the call from __rcu_pending_enqueue() to process_finished_items()
> > serve this purpose?  After all, the case that causes trouble is the one
> > where lots of things are being very frequently queued.
> 
> No, because that's unpredictable, and we don't process pending items in
> enqueue() unless we know we can sleep (i.e., we don't do it if the
> caller is latency sensitive).

It is possible to do this in an extremely lightweight fashion in the
common case.

							Thanx, Paul




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux