[no subject]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> > OK, this does appear like each group of objs gets its own return value from
> > either __start_poll_synchronize_rcu() or __get_state_synchronize_rcu().  If
> > so, much improved!
> 
> To be honest I still prefer the old way.
> 
> This scheme requires multiple loops over the array, which leads to the
> kvfree_call_rcu() based on this being quite branchy; the generated code
> could be a fair bit slimmer with the old scheme.

The time that this matters is when someone is invoking this in a tight
loop, in which case the branch predictor will to quite well, given that
grace periods take some time to complete.

> > The purpose of this rcu_head is as before, to flag good times to invoke
> > poll_state_synchronize_rcu()?
> 
> Correct
> 
> > I need to check what the caller does with it, of course.  And one caller
> > appears to invokes it from a workqueue handler, the other from unknown
> > context.  But you said earlier that all were from process context,
> > so that should be fine.  Having the submitters do processing does have
> > nice overload-handling characteristics, though it might be more prone
> > to deadlock.  (As in you cannot acquire a mutex in the callback that is
> > held across a call to the bch2_pending_rcu_item_enqueue() function.)
> 
> That's true for my use in bcachefs, but thanks for pointing that out,
> that does look like a real problem for using this for a
> kvfree_call_rcu() replacement.
> 
> kvfree_call_rcu() can't necessarily process completions in the caller,
> since it can be called from interrupt context; that means unbounded
> grace periods with uncompleted callbacks build up unbounded work.
> 
> And, with these data structures that would be really painful.

Agreed, if you can keep your object queueing in preempt-enabled task
context, life is much easier.  But yes, kfree_rcu() doesn't get that
luxury, nor does call_rcu().

> Which leads me to wonder, how conceivable would it be to put a limit on
> the number of grace periods with uncompleted callbacks? Use of that
> interface would have to be limited to trusted code (because it could
> stall the system!), but I wonder if it might have positive effects on
> overload behaviour.
> 
> Hmm.

For a sufficiently constrained system (and your bcachefs use case might
qualify for all I know), this might work very well.  The problem is that
"trusted" would need to include "not running on a hypervisor" and "not
running with firmware having longer-than-average SMIs.  :-(

> > What provides synchronization for this function?  It looks like it can
> > be invoked from multiple contexts.  Ah, the pending_rcu_items_pcpu
> > structure's ->lock, which is dropped after invoking this but before
> > processing the objects.
> 
> Yes, and in my latest version that's stricter about avoiding preemption,
> the lock is only necessary for dequeue_from_all() - all other options
> are careful to only work on the CPU local version now.

Using migration_disable() or similar?

> > If both elements are ready to process, it looks like you only process
> > the first one that you come to.  Why not look at both and return both
> > if both are ready?
> > 
> > A linked list of pages of pointers would make this possible by simply
> > concatenating the lists, right?
> > 
> > Or am I missing something that forces the second element of the array
> > to be checked?
> 
> The elements of the array aren't ordered now (per your scheme). But
> perhaps they still could be.

Well, they are, not in terms of the ->seq value, but in terms of the
index.

Suppose that there has been a huge pile of objects queued up over the past
while, so that both ->objs[] elements are quite full.  Suppose that,
just by chance, the ->obj[0] ->seq is the first to have its grace
period complete.  Suppose further that just at that point in time, the
object-queuing rate suddenly decreases.  So much so, that at least one
full grace period elapses between queuing of individual objects.

Ah, the loop in pending_rcu_items_work() should take care of things
here, maybe.

> > I might be missing something, but it looks like a linked list of pages of
> > pointers would do better here, especially if there are a lot of callbacks.
> > So what am I missing?
> 
> I'll be improving the genradix iterator code to avoid the radix tree
> walk except when crossing nodes.
> 
> I _much_ prefer using a standard data structure instead of hand rolling,
> and this code is both cleaner and more compact - and faster.

Perhaps there are things we could do to make the pages of pointers better.
Or maybe we should adopt radix tree, though it still seems like a very
strange fit.  Some careful performance analysis will need to drive this
choice.  Avoiding ever forgetting the "status quo" option.

> > Also, I am not seeing a ->pending() field in the pending_rcu_items_seq
> > structure.  Again, what am I missing?
> 
> Not necessary, since it's pending iff it has items - except you indicate
> a callback, so perhas you're thinking of something else?

My typo, ->process(), not ->pending().  Apologies for my confusion!

> > OK, processing this with interrupts enabled is a great improvement!
> > 
> > Also doing it in workqueue context is a good thing.  I don't yet have
> > an opinion about how this code handles overload conditions.
> 
> > > +void bch2_pending_rcu_item_enqueue(struct pending_rcu_items *pending, struct rcu_head *obj)
> > 
> > As noted earlier, I am assuming that this is invoked from task context
> > with preemption enabled.
> 
> In my usage, yes. I would prefer to not require that.
> 
> It's a bit unfortunate that we still don't have a way of checking
> (outside of debug code) "are we in sleepable process context?" - that'll
> have to be passed in.

That might arrive with lazy preemption.  It would untangle quite a few
knots in many parts of RCU, that is for sure.

> > It might be worth a comment stating that the ->process() function cannot
> > acquire any mutex held across a call to bch2_pending_rcu_item_enqueue().
> 
> Yes, and a note about it being called from enqueue()

Very good!

> > > +	struct pending_rcu_items_seq finished;
> > > +	while (get_finished_items(pending, p, &finished)) {
> > > +		spin_unlock_irqrestore(&p->lock, flags);
> > > +		process_finished_items(pending, &finished);
> > > +		goto retry;
> > 
> > This loop could be avoided by passing obj in to get_finished_items(),
> > and having that function enqueue obj just after emptying things.  Yes,
> > an infinite loop requires an improbably repeated sequence of preemptions
> > and migrations, but why not eliminate the possibility?
> 
> I've reworked this code somewhat to inline fastpaths and move slowpath
> to separate functions, have a look at the latest when I post it.

Looking forward to it.

> > > +	struct pending_rcu_items_seq *objs;
> > > +
> > > +	unsigned long seq = __get_state_synchronize_rcu(pending->srcu);
> > 
> > Invoking __get_state_synchronize_rcu() before the call to get_finished_items()
> > could save you some latency.
> 
> ??

It is always the same object, and it has the same history being
reader-visible.  So get the cookie as early as possible and keep that
cookie.  Just because you took time out to process some objects does
not mean you need a newer cookie, after all.

My advice is as follows:

1.	Pick up the cookie, whether from get_synchronize_srcu() or
	poll_state_synchronize_rcu().

2.	If that cookie matches one of the ->objs[]->seq, add it to
	that ->objs[].  Null the pointer or whatever to record that
	it is already taken care of.  Or just take an early exit.

	If under heavy load, this will be the common-case code path.

3.	Only then try processing the ->objs[] elements.

4.	Yes, then you need to queue the thing, but you can make
	#2 above be an inline function to avoid duplicating code.
	And you can check to see if the cookie already expired,
	in which case you can just process it immediately.

	You only need to retry in case of memory-allocation failure.

Does that make sense?

> > > +	for (objs = p->objs; objs < p->objs + ARRAY_SIZE(p->objs); objs++)
> > > +		if (objs->nr && objs->seq == seq)
> > 
> > OK, the purpose of same_state_synchronize_srcu() is to make that
> > equality comparison work no matter what debugging information we might
> > need to put in the cookie returned from get_state_synchronize_srcu().
> > Or, for that matter, poll_state_synchronize_srcu().
> > 
> > So please use same_state_synchronize_srcu(objs->seq,seq) here.
> 
> Done

Thank you!

> > I wasn't kidding.  That call to __start_poll_synchronize_rcu() absolutely
> > must precede that call to get_finished_items().  ;-)
> 
> Except start_poll_synchronize_rcu() is higher overhead than
> get_state_synchronize_rcu() - the idea is to avoid calling it when we
> know it's not necessary because we already have items for that sequence
> number.

You can probably instead call get_state_synchronize_srcu() to begin
with, and let the later call_srcu() start the grace period if needed.
The only exception I can imagine to this right off-hand is under heavy
load, where you might need to start the grace period sooner rather than
later.  But that is easy -- if you have too many objects piled up, call
poll_state_synchronize_srcu() instead of get_state_synchronize_srcu().

And there are ways of avoiding even that get_state_synchronize_srcu()
in the common case, but at the cost of additional complexity.  So later,
when and if.

> But I think there must be a better way to do this, and we shouldn't be
> calling start_poll() multiple times when we're looping, yeah.

Whew!!!  ;-)

> > > +	struct rcu_head **entry;
> > > +add:
> > > +	entry = genradix_ptr_alloc(&objs->objs, objs->nr, GFP_ATOMIC|__GFP_NOWARN);
> > > +	if (likely(entry)) {
> > > +		*entry = obj;
> > > +		objs->nr++;
> > > +	} else if (likely(!alloc_failed)) {
> > > +		spin_unlock_irqrestore(&p->lock, flags);
> > > +		alloc_failed = !genradix_ptr_alloc(&objs->objs, objs->nr, GFP_KERNEL);
> > 
> > Just so you know, the corresponding part of kfree_rcu() was the subject
> > of much churn as the mm system evolved.  But you can only go through this
> > retry loop twice, so I am OK, with it.
> > 
> > But you really want to keep this structure, you also really want to call
> > __start_poll_synchronize_rcu() before the retry: label.
> > 
> > If nothing else, you can check the cookie on allocation failure, and if it
> > has expired, just immediately free the object.
> 
> Yeah, that's a good thought.
> 
> > > +	if (!p->rcu_armed) {
> > > +		call_srcu(pending->srcu, &p->rcu, pending_rcu_items_rcu_cb);
> > > +		p->rcu_armed = true;
> > 
> > In the other order, please!
> > 
> > Yes, you have interrupts disabled here, but you are not in an SRCU
> > read-side critical section, and a hypervisor isn't necessarily going to
> > know or care about your interrupt-disabled state.  So the way that the
> > code is written, you really can have the SRCU callback invoked before
> > setting ->rcu_armed to true, which might not be a good thing.
> > 
> > Especially given that it won't happen very often.  Except on large
> > fleets.  ;-)
> 
> Noted :)
> 
> > > +struct rcu_head *bch2_pending_rcu_item_dequeue_from_all(struct pending_rcu_items *pending)
> > > +{
> > > +	struct rcu_head *ret = NULL;
> > > +	int cpu;
> > > +	for_each_possible_cpu(cpu) {
> > > +		ret = pending_rcu_item_pcpu_dequeue(per_cpu_ptr(pending->p, cpu));
> > > +		if (ret)
> > > +			break;
> > > +	}
> > > +	return ret;
> > > +}
> > 
> > I assume that the point of these guys is to recycle objects that are
> > waiting for their RCU grace period, similar to SLAB_TYPESAFE_BY_RCU.
> > Readers would of course need the same sort of checks as used by readers
> > traversing memory from SLAB_TYPESAFE_BY_RCU slabs.
> > 
> > If my assumption is correct, handing out the oldest items minimizes the
> > amount of retrying these readers need to do.
> 
> I hand out the newest to minize stranding - if we're trying to return
> memory to the rest of the system, we don't want to reuse the objects
> we're about to be able to free.

Fair enough, and I agree that it is not a trivial tradeoff.

							Thanx, Paul




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux