Re: Improving RCU wait in mini_qdisc_pair_swap()

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 20 Oct 2021 15:44:30 -0700

On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> Hi Paul,

Hello, Seth!

Adding the rcu list on CC in case others are seeing similar issues.

> You got some questions the other day from (I believe) Mathieu Desnoyers,
> who a colleague of mine had reached out to about this problem. Thanks a
> lot for your suggestions! I was headed in the right direction, but I
> wouldn't have got there as quickly, and I'm not sure I would have known
> to use start_poll_syncronize_rcu(). I do have a question though, so I
> thought I'd reach out to you about it while I continue to test the
> patch.
> 
> To refresh your memory, or fill in bits that might not have made it to
> you: we're seeing a lot of time spent in rcu_barrier() from
> mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> of network connections for VMs. I'm trying to make this go faster.
> 
> My patch is pasted below, which speeds things up drastically, by a
> factor of 10,000 or so in most cases. My question is about initializing
> mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> initialized it from get_state_synchronize_rcu() when the structure is
> first initialized. That patch didn't really help. I suspected this was
> because the time between init and calling mini_qdisc_pair_swap() was
> short enough that a grace period had not elapsed, so in the patch below
> rcu_state is initialized to 0 (due to the fact that the memory is
> kzalloc-ed). I think this should be safe, based on my understanding of
> how the cookie works and the fact that no other task yet has a reference
> to either buffer. Is this correct?

Let me see if I understand your reasoning.  Right after creating the
mini Qdisc, someone might be accessing the current buffer, but no one
ever was accessing the non-current buffer.  So on the initial swap
(and -only- on the initial swap), there are no readers to wait for.
Is that your line of reasoning?  If so, sort-of OK, but let's talk about
more organized ways of handling this.

However, another approach is something like this:

	oldstate = start_poll_synchronize_rcu();

	// do something.

	if (!poll_state_synchronize_rcu(oldstate))
		synchronize_rcu_expedited();

This will be roughly an order of magnitude faster than you would see with
your current cond_synchronize_rcu().  And it will speed up all swaps,
not just the first one.  On the other hand, synchronize_rcu_expedited()
sends IPIs to all non-idle CPUs, which is often OK, but not so much for
heavy real-time workloads.  So you can let the heavy real-time workloads
have slowish Qdisc switching and IPI the IPI-insensitive workloads:

	oldstate = start_poll_synchronize_rcu();

	// do something.

	if (IS_ENABLED(CONFIG_PREEMPT_RT))
		cond_synchronize_rcu(oldstate);
	else if (!poll_state_synchronize_rcu(oldstate))
		synchronize_rcu_expedited();

Yet another approach is this:

	oldstate = start_poll_synchronize_rcu();

	// do something.

	while (!poll_state_synchronize_rcu(oldstate))
		schedule_timeout_idle(1);

The advantage of this one is that you get credit for partial progress
through grace periods, but it won't be anywhere near as fast as
synchronize_rcu_expedited().

							Thanx, Paul