Re: srcu_cleanup warning

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 14 Jun 2024 17:27:01 -0700

On Fri, Jun 14, 2024 at 04:14:29PM -0400, Kent Overstreet wrote:
> On Fri, Jun 14, 2024 at 07:11:50AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 13, 2024 at 08:22:19PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 13, 2024 at 02:32:27PM -0400, Kent Overstreet wrote:
> > 
> > [ . . . ]
> > 
> > After sleeping on it, backing up a level of abstraction, and thus perhaps
> > responding a bit more constructively...
> > 
> > And also, thank you very much for looking into this!
> > 
> > > > commit 3bb5d34d6d6e2d5433cce5d248fd81211afedec0
> > > > Author: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> > > > Date:   Mon Jun 10 20:47:03 2024 -0400
> > > > 
> > > >     bcachefs: pending_rcu_items
> > > >     
> > > >     Generic (?) data structure for explicitly tracking pending SRCU items
> > > >     
> > > >     - Can we make this generic across other RCU flavors?
> > 
> > It should be possible, at least for current RCU and SRCU.  I think I owe
> > you SRCU variants of a couple of existing RCU functions.  Later today!
> 
> I was wondering how much commonality there is between the core data
> structures of RCU and SRCU. If RCU internally has something analagous to
> srcu_struct, and we could refer to that object without caring if it's
> for RCU or SRCU, that would be cool... otherwise, we can just do it a
> level up with switch statements.

There is a lot of commonality at the conceptual level, but lots of
differences due to radically different design tradeoffs.  And don't
forget about the Tiny variants of both RCU and SRCU.  (With the current
API, you shouldn't need to care.)  Things will get more fun if any of
the three RCU Tasks variants need polling interfaces.  Not a big deal,
though, either a pointer for SRCU or a small-integer enum for the others.

And I haven't heard any hint of a need for polled interfaces for any of
the Tasks RCU variants.

> > > >     - Do we need to add a fallback to linked lists when darray allocation
> > > >       fails?
> > 
> > In general, as in kfree_rcu(), yes.  In your specific case here, maybe not.
> > 
> > > >     - We wish to avoid the overhead of start_poll_synchronize_srcu() if we
> > > >       already have items for a given sequence number, but it appears
> > > >       get_poll_synchronize_srcu(); start_poll_synchronize_srcu() can
> > > >       actually return differert sequence numbers within the same read-side
> > > >       critical section. Is this the correct way of doing this?
> > 
> > You can always buffer up items that have not yet been assigned a sequence
> > number, then after you have enough or after enough time has passed, do a
> > single get_poll_synchronize_srcu() for the group.
> 
> I don't think we'd want to do that, since that would be delaying the
> point at which objects get reclaimed. Have a look at my latest version
> and tell me if you think that works? It seems to work in my testing.

Ah, it did arrive.  I will take a look by the end of Monday (maybe
sooner, depending).

If a grace period is currently in progress, there is no penalty for
batching.  Huh.  If the overhead of that get_poll_synchronize_srcu()
becomes a problem, there are some tricks we can play.  But let's get
the basics working well first.

> > > >     - Do we need to ensure the rearming RCU callback is correctly flushed
> > > >       when exiting? (Probably)
> > 
> > I believe so, and setting a flag then doing an srcu_barrier() should suffice.
> 
> Is the only difference between srcu_barrier() and synchronize_srcu()
> that srcu_barrier() is a noop if callbacks weren't enqueued?

They are orthogonal.  Keep in mind that synchronize_srcu() does not
necessarily wait on any callbacks, just like srcu_barrier() does not
necessarily wait on any grace period.

Now, as you might have noticed, as an accident of implementation on
systems with fewer than 128 CPUs, if a given srcu_struct never has
had significant grace-period activity, there will be a single list
of callbacks, so synchronize_srcu() will wait for all pre-existing
callbacks.  But if there are more CPUs or if there is heavy update-side
load, then there will be one callback queue per CPU, and in that case
synchronize_srcu() will wait only for pre-existing callbacks on its CPU.

And recent work means that synchronize_rcu() does not necessarily ever
wait on callbacks, and perhaps some day that same optimization will be
applied to SRCU.

> > > > +static void pending_rcu_items_process(struct pending_rcu_items *pending)
> > > >  {
> > > > -	struct bkey_cached *ck = container_of(rcu, struct bkey_cached, rcu);
> > > > +	while (pending_rcu_has_items(pending) &&
> > > > +	       poll_state_synchronize_srcu(pending->srcu, pending->seq)) {
> > > > +		darray_for_each(pending->objs[0], i)
> > > > +			pending->process(pending, *i);
> > 
> > Given that you have interrupts and/or bh disabled, there needs to be a
> > third set of callbacks, those whose grace periods are already elapsed.
> 
> Not following? The pending->process() is exactly that callback.
> 
> Also, the new version does processing in proccess context, without the
> lock.

OK, I will look at the new version.

In the meantime, I am guessing that there is some delay between the end
of a grace period and the processing of that grace period's callbacks,
and if that delay is perceptible, those callbacks will need to be gotten
out of the way of new callbacks.

> > Unioning one of the ->objs[] sets onto this third set, leaving the source
> > ->objs[] set empty, should strongly be O(1) without the need for memory
> > allocation.  I might be missing something, but at first glance darray
> > does not support this.
> > 
> > The pages-of-pointers structure named kvfree_rcu_bulk_data, along with
> > the associated structures in kernel/rcu/tree.c does this work, but (1)
> > is open-coded and (2) has to deal with the full range of what the kernel
> > can throw at it.  It falls back to an embedded rcu_head structure, in
> > response to your second question in the commit log.
> > 
> > But perhaps this code could be generalized?  Or maybe darray can be
> > appropriately extended?
> 
> Not sure what you mean by "unioning onto a third set"?

Huge piles of objects are coming into this system.  As grace periods a
goodly chunk of them need to be processed.  This processing would take
significant time, quite possible multiple SRCU grace periods worth
of time.  Unless the arrival of new callbacks can be blocked while
these callbacks are being processed, you need a place to accumulate the
callbacks as their grace periods expire.

> > And one thing that I missed the first time around is that you are using
> > a single ->seq for two sets of objects (->objs[0] and ->objs[1]).
> > Each set needs its own sequence number.
> 
> I've got this changed in my latest version, but - I did see one BUG_ON()
> where we seem to have three sequence numbers in flight. Any idea what
> might be up with that?

Although at any given time, there will be at most two unexpired cookie
values, time delays can make it appear as if there are more.  If you
have three, one of them has expired, but the problem is that you might
have already checked it, but before it got around to expiring.

That explanation might or might not make sense, but I will look at your
new code and see what is up.

							Thanx, Paul