On Tue, Jul 28, 2020 at 04:46:23PM -0700, Sagi Grimberg wrote: > Hey Paul, > > > Indeed you cannot. And if you build with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y > > it will yell at you when you try. > > > > You -can- pass on-stack rcu_head structures to call_srcu(), though, > > if that helps. You of course must have some way of waiting for the > > callback to be invoked before exiting that function. This should be > > easy for me to package into an API, maybe using one of the existing > > reference-counting APIs. > > > > So, do you have a separate stack frame for each of the desired call_srcu() > > invocations? If not, do you know at build time how many rcu_head > > structures you need? If the answer to both of these is "no", then > > it is likely that there needs to be an rcu_head in each of the relevant > > data structures, as was noted earlier in this thread. > > > > Yeah, I should go read the code. But I would need to know where it is > > and it is still early in the morning over here! ;-) > > > > I probably should also have read the remainder of the thread before > > replying, as well. But what is the fun in that? > > The use-case is to quiesce submissions to queues. This flow is where we > want to teardown stuff, and we can potentially have 1000's of queues > that we need to quiesce each one. > > each queue (hctx) has either rcu or srcu depending if it may sleep > during submission. > > The goal is that the overall quiesce should be fast, so we want > to wait for all of these queues elapsed period ~once, in parallel, > instead of synchronizing each serially as done today. > > The guys here are resisting to add a rcu_synchronize to each and > every hctx because it will take 32 bytes more or less from 1000's > of hctxs. > > Dynamically allocating each one is possible but not very scalable. > > The question is if there is some way, we can do this with on-stack > or a single on-heap rcu_head or equivalent that can achieve the same > effect. If the hctx structures are guaranteed to stay put, you could count them and then do a single allocation of an array of rcu_head structures (or some larger structure containing an rcu_head structure, if needed). You could then sequence through this array, consuming one rcu_head per hctx as you processed it. Once all the callbacks had been invoked, it would be safe to free the array. Sounds too simple, though. So what am I missing? Thanx, Paul