Re: Improving RCU wait in mini_qdisc_pair_swap()

Seth Forshee <seth@xxxxxxxxxx> · Thu, 21 Oct 2021 09:15:22 -0500

On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> > Hi Paul,
> 
> Hello, Seth!
> 
> Adding the rcu list on CC in case others are seeing similar issues.
> 
> > You got some questions the other day from (I believe) Mathieu Desnoyers,
> > who a colleague of mine had reached out to about this problem. Thanks a
> > lot for your suggestions! I was headed in the right direction, but I
> > wouldn't have got there as quickly, and I'm not sure I would have known
> > to use start_poll_syncronize_rcu(). I do have a question though, so I
> > thought I'd reach out to you about it while I continue to test the
> > patch.
> > 
> > To refresh your memory, or fill in bits that might not have made it to
> > you: we're seeing a lot of time spent in rcu_barrier() from
> > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> > of network connections for VMs. I'm trying to make this go faster.
> > 
> > My patch is pasted below, which speeds things up drastically, by a
> > factor of 10,000 or so in most cases. My question is about initializing
> > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> > initialized it from get_state_synchronize_rcu() when the structure is
> > first initialized. That patch didn't really help. I suspected this was
> > because the time between init and calling mini_qdisc_pair_swap() was
> > short enough that a grace period had not elapsed, so in the patch below
> > rcu_state is initialized to 0 (due to the fact that the memory is
> > kzalloc-ed). I think this should be safe, based on my understanding of
> > how the cookie works and the fact that no other task yet has a reference
> > to either buffer. Is this correct?
> 
> Let me see if I understand your reasoning.  Right after creating the
> mini Qdisc, someone might be accessing the current buffer, but no one
> ever was accessing the non-current buffer.  So on the initial swap
> (and -only- on the initial swap), there are no readers to wait for.
> Is that your line of reasoning?  If so, sort-of OK, but let's talk about
> more organized ways of handling this.

That's almost right, just missing a couple of details. Initially neither
buffer (let's call them miniq1 and miniq2 for clarity) is active, and
*p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there
can be no readers for either. On the first swap miniq1 becomes active,
and there are no readers to wait for. On the second swap there are no
readers of miniq2 to wait for, and at this time miniq1->rcu_state will
be set. On the third swap we might finally have readers of miniq1 to
wait for.

Also, this function is a little different from the way your examples are
structured in that oldstate is saved at the end of the function, and the
wait on oldstate happens in a subsequent call to the function. So it's
more like:

        oldstate = start_poll_synchronize_rcu();
        return;

        // Some unknown amount of time passes ...

        cond_synchronize_rcu(oldstate);

So I suspect that it wouldn't be uncommon for enough time to have passed
that there are no readers to wait for.

But that brings us to an important point. I'm not a networking expert, I
don't know what common patterns are for these swaps, so I'm probably
ill-equipped to say what makes the most sense beyond the specific
scenario I'm trying to improve. These swaps will happen when setting up
QoS on a network interface, so if I had to guess I'd think the most
common pattern would be short bursts of activity when interfaces are
set up. I know the pattern in the use case I've been debugging, which
sees a lot of "first swaps" for different instances of mini_Qdisc_pair,
and a smaller number of "second swaps" about 200 ms after the first,
which means we're paying a high price for the rcu_barrier() when there
can be no readers. Other scenaios might result in multiple swaps on the
same instance within a short time period, but I can't say for sure.

> However, another approach is something like this:
> 
> 	oldstate = start_poll_synchronize_rcu();
> 
> 	// do something.
> 
> 	if (!poll_state_synchronize_rcu(oldstate))
> 		synchronize_rcu_expedited();
> 
> This will be roughly an order of magnitude faster than you would see with
> your current cond_synchronize_rcu().  And it will speed up all swaps,
> not just the first one.  On the other hand, synchronize_rcu_expedited()
> sends IPIs to all non-idle CPUs, which is often OK, but not so much for
> heavy real-time workloads.
> 
> So you can let the heavy real-time workloads
> have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> 
> 	oldstate = start_poll_synchronize_rcu();
> 
> 	// do something.
> 
> 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> 		cond_synchronize_rcu(oldstate);
> 	else if (!poll_state_synchronize_rcu(oldstate))
> 		synchronize_rcu_expedited();

I was concerned about the system-wide impact synchronize_rcu_expedited()
has, but this at least avoids doing it if it's unnecessary. If it's bad
form to leave mini_Qdisc.rcu_state initialized to 0, this option is
preferable from my perspective.

> 
> Yet another approach is this:
> 
> 	oldstate = start_poll_synchronize_rcu();
> 
> 	// do something.
> 
> 	while (!poll_state_synchronize_rcu(oldstate))
> 		schedule_timeout_idle(1);
> 
> The advantage of this one is that you get credit for partial progress
> through grace periods, but it won't be anywhere near as fast as
> synchronize_rcu_expedited().

If we need to initialize mini_Qdisc.rcu_state up front this approach
probably won't improve things much for my problem.

Thanks,
Seth