Re: Improving RCU wait in mini_qdisc_pair_swap()

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 21 Oct 2021 10:55:37 -0700

On Thu, Oct 21, 2021 at 09:15:22AM -0500, Seth Forshee wrote:
> On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> > > Hi Paul,
> > 
> > Hello, Seth!
> > 
> > Adding the rcu list on CC in case others are seeing similar issues.
> > 
> > > You got some questions the other day from (I believe) Mathieu Desnoyers,
> > > who a colleague of mine had reached out to about this problem. Thanks a
> > > lot for your suggestions! I was headed in the right direction, but I
> > > wouldn't have got there as quickly, and I'm not sure I would have known
> > > to use start_poll_syncronize_rcu(). I do have a question though, so I
> > > thought I'd reach out to you about it while I continue to test the
> > > patch.
> > > 
> > > To refresh your memory, or fill in bits that might not have made it to
> > > you: we're seeing a lot of time spent in rcu_barrier() from
> > > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> > > of network connections for VMs. I'm trying to make this go faster.
> > > 
> > > My patch is pasted below, which speeds things up drastically, by a
> > > factor of 10,000 or so in most cases. My question is about initializing
> > > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> > > initialized it from get_state_synchronize_rcu() when the structure is
> > > first initialized. That patch didn't really help. I suspected this was
> > > because the time between init and calling mini_qdisc_pair_swap() was
> > > short enough that a grace period had not elapsed, so in the patch below
> > > rcu_state is initialized to 0 (due to the fact that the memory is
> > > kzalloc-ed). I think this should be safe, based on my understanding of
> > > how the cookie works and the fact that no other task yet has a reference
> > > to either buffer. Is this correct?
> > 
> > Let me see if I understand your reasoning.  Right after creating the
> > mini Qdisc, someone might be accessing the current buffer, but no one
> > ever was accessing the non-current buffer.  So on the initial swap
> > (and -only- on the initial swap), there are no readers to wait for.
> > Is that your line of reasoning?  If so, sort-of OK, but let's talk about
> > more organized ways of handling this.
> 
> That's almost right, just missing a couple of details. Initially neither
> buffer (let's call them miniq1 and miniq2 for clarity) is active, and
> *p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there
> can be no readers for either. On the first swap miniq1 becomes active,
> and there are no readers to wait for. On the second swap there are no
> readers of miniq2 to wait for, and at this time miniq1->rcu_state will
> be set. On the third swap we might finally have readers of miniq1 to
> wait for.
> 
> Also, this function is a little different from the way your examples are
> structured in that oldstate is saved at the end of the function, and the
> wait on oldstate happens in a subsequent call to the function. So it's
> more like:
> 
>         oldstate = start_poll_synchronize_rcu();
>         return;
> 
>         // Some unknown amount of time passes ...
> 
>         cond_synchronize_rcu(oldstate);
> 
> So I suspect that it wouldn't be uncommon for enough time to have passed
> that there are no readers to wait for.
> 
> But that brings us to an important point. I'm not a networking expert, I
> don't know what common patterns are for these swaps, so I'm probably
> ill-equipped to say what makes the most sense beyond the specific
> scenario I'm trying to improve. These swaps will happen when setting up
> QoS on a network interface, so if I had to guess I'd think the most
> common pattern would be short bursts of activity when interfaces are
> set up. I know the pattern in the use case I've been debugging, which
> sees a lot of "first swaps" for different instances of mini_Qdisc_pair,
> and a smaller number of "second swaps" about 200 ms after the first,
> which means we're paying a high price for the rcu_barrier() when there
> can be no readers. Other scenaios might result in multiple swaps on the
> same instance within a short time period, but I can't say for sure.

Very good!  We can pool our ignorance of networking!  ;-)

If the common case really is mostly just the one swap, I can easily give
you an API that hands you an already-expired "oldstate".  Full disclosure:
On 32-bit systems, after 2^31 grace periods (maybe as quickly as three
months of runtime), the "oldstate" would no longer appear to be at
all old.  And would take another 2^31 grace periods to look old again.

However, that is an unusual API, so I would want to know that it was
really useful before creating it.

Plus if you start early enough at boot, the value of zero will not look
old, just so you know.  ;-)

> > However, another approach is something like this:
> > 
> > 	oldstate = start_poll_synchronize_rcu();
> > 
> > 	// do something.
> > 
> > 	if (!poll_state_synchronize_rcu(oldstate))
> > 		synchronize_rcu_expedited();
> > 
> > This will be roughly an order of magnitude faster than you would see with
> > your current cond_synchronize_rcu().  And it will speed up all swaps,
> > not just the first one.  On the other hand, synchronize_rcu_expedited()
> > sends IPIs to all non-idle CPUs, which is often OK, but not so much for
> > heavy real-time workloads.
> > 
> > So you can let the heavy real-time workloads
> > have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> > 
> > 	oldstate = start_poll_synchronize_rcu();
> > 
> > 	// do something.
> > 
> > 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > 		cond_synchronize_rcu(oldstate);
> > 	else if (!poll_state_synchronize_rcu(oldstate))
> > 		synchronize_rcu_expedited();
> 
> I was concerned about the system-wide impact synchronize_rcu_expedited()
> has, but this at least avoids doing it if it's unnecessary. If it's bad
> form to leave mini_Qdisc.rcu_state initialized to 0, this option is
> preferable from my perspective.

This might well be the best approach if multiple swaps are at all common.

But if needed, we really could do both:  Make initial swaps really fast,
and improve the performance of later swaps.

> > Yet another approach is this:
> > 
> > 	oldstate = start_poll_synchronize_rcu();
> > 
> > 	// do something.
> > 
> > 	while (!poll_state_synchronize_rcu(oldstate))
> > 		schedule_timeout_idle(1);
> > 
> > The advantage of this one is that you get credit for partial progress
> > through grace periods, but it won't be anywhere near as fast as
> > synchronize_rcu_expedited().
> 
> If we need to initialize mini_Qdisc.rcu_state up front this approach
> probably won't improve things much for my problem.

It all depends on how long between swaps.  ;-)

							Thanx, Paul