I like the tagset based interface. But the idea of doing a per-hctx allocation and wait doesn't seem very scalable. Paul, do you have any good idea for an interface that waits on multiple srcu heads? As far as I can tell we could just have a single global completion and counter, and each call_srcu would just just decrement it and then the final one would do the wakeup. It would just be great to figure out a way to keep the struct rcu_synchronize and counter on stack to avoid an allocation. But if we can't do with an on-stack object I'd much rather just embedd the rcu_head in the hw_ctx.