On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote: > Hi Paul, Hello, Seth! Adding the rcu list on CC in case others are seeing similar issues. > You got some questions the other day from (I believe) Mathieu Desnoyers, > who a colleague of mine had reached out to about this problem. Thanks a > lot for your suggestions! I was headed in the right direction, but I > wouldn't have got there as quickly, and I'm not sure I would have known > to use start_poll_syncronize_rcu(). I do have a question though, so I > thought I'd reach out to you about it while I continue to test the > patch. > > To refresh your memory, or fill in bits that might not have made it to > you: we're seeing a lot of time spent in rcu_barrier() from > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch > of network connections for VMs. I'm trying to make this go faster. > > My patch is pasted below, which speeds things up drastically, by a > factor of 10,000 or so in most cases. My question is about initializing > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and > initialized it from get_state_synchronize_rcu() when the structure is > first initialized. That patch didn't really help. I suspected this was > because the time between init and calling mini_qdisc_pair_swap() was > short enough that a grace period had not elapsed, so in the patch below > rcu_state is initialized to 0 (due to the fact that the memory is > kzalloc-ed). I think this should be safe, based on my understanding of > how the cookie works and the fact that no other task yet has a reference > to either buffer. Is this correct? Let me see if I understand your reasoning. Right after creating the mini Qdisc, someone might be accessing the current buffer, but no one ever was accessing the non-current buffer. So on the initial swap (and -only- on the initial swap), there are no readers to wait for. Is that your line of reasoning? If so, sort-of OK, but let's talk about more organized ways of handling this. However, another approach is something like this: oldstate = start_poll_synchronize_rcu(); // do something. if (!poll_state_synchronize_rcu(oldstate)) synchronize_rcu_expedited(); This will be roughly an order of magnitude faster than you would see with your current cond_synchronize_rcu(). And it will speed up all swaps, not just the first one. On the other hand, synchronize_rcu_expedited() sends IPIs to all non-idle CPUs, which is often OK, but not so much for heavy real-time workloads. So you can let the heavy real-time workloads have slowish Qdisc switching and IPI the IPI-insensitive workloads: oldstate = start_poll_synchronize_rcu(); // do something. if (IS_ENABLED(CONFIG_PREEMPT_RT)) cond_synchronize_rcu(oldstate); else if (!poll_state_synchronize_rcu(oldstate)) synchronize_rcu_expedited(); Yet another approach is this: oldstate = start_poll_synchronize_rcu(); // do something. while (!poll_state_synchronize_rcu(oldstate)) schedule_timeout_idle(1); The advantage of this one is that you get credit for partial progress through grace periods, but it won't be anywhere near as fast as synchronize_rcu_expedited(). Thanx, Paul