Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 28 Jul 2020 17:31:24 -0700

On Tue, Jul 28, 2020 at 04:46:23PM -0700, Sagi Grimberg wrote:
> Hey Paul,
> 
> > Indeed you cannot.  And if you build with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
> > it will yell at you when you try.
> > 
> > You -can- pass on-stack rcu_head structures to call_srcu(), though,
> > if that helps.  You of course must have some way of waiting for the
> > callback to be invoked before exiting that function.  This should be
> > easy for me to package into an API, maybe using one of the existing
> > reference-counting APIs.
> > 
> > So, do you have a separate stack frame for each of the desired call_srcu()
> > invocations?  If not, do you know at build time how many rcu_head
> > structures you need?  If the answer to both of these is "no", then
> > it is likely that there needs to be an rcu_head in each of the relevant
> > data structures, as was noted earlier in this thread.
> > 
> > Yeah, I should go read the code.  But I would need to know where it is
> > and it is still early in the morning over here!  ;-)
> > 
> > I probably should also have read the remainder of the thread before
> > replying, as well.  But what is the fun in that?
> 
> The use-case is to quiesce submissions to queues. This flow is where we
> want to teardown stuff, and we can potentially have 1000's of queues
> that we need to quiesce each one.
> 
> each queue (hctx) has either rcu or srcu depending if it may sleep
> during submission.
> 
> The goal is that the overall quiesce should be fast, so we want
> to wait for all of these queues elapsed period ~once, in parallel,
> instead of synchronizing each serially as done today.
> 
> The guys here are resisting to add a rcu_synchronize to each and
> every hctx because it will take 32 bytes more or less from 1000's
> of hctxs.
> 
> Dynamically allocating each one is possible but not very scalable.
> 
> The question is if there is some way, we can do this with on-stack
> or a single on-heap rcu_head or equivalent that can achieve the same
> effect.

If the hctx structures are guaranteed to stay put, you could count
them and then do a single allocation of an array of rcu_head structures
(or some larger structure containing an rcu_head structure, if needed).
You could then sequence through this array, consuming one rcu_head per
hctx as you processed it.  Once all the callbacks had been invoked,
it would be safe to free the array.

Sounds too simple, though.  So what am I missing?

							Thanx, Paul