On Thu, Oct 21, 2021 at 10:55:37AM -0700, Paul E. McKenney wrote: > On Thu, Oct 21, 2021 at 09:15:22AM -0500, Seth Forshee wrote: > > On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote: > > > On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote: > > > > Hi Paul, > > > > > > Hello, Seth! > > > > > > Adding the rcu list on CC in case others are seeing similar issues. > > > > > > > You got some questions the other day from (I believe) Mathieu Desnoyers, > > > > who a colleague of mine had reached out to about this problem. Thanks a > > > > lot for your suggestions! I was headed in the right direction, but I > > > > wouldn't have got there as quickly, and I'm not sure I would have known > > > > to use start_poll_syncronize_rcu(). I do have a question though, so I > > > > thought I'd reach out to you about it while I continue to test the > > > > patch. > > > > > > > > To refresh your memory, or fill in bits that might not have made it to > > > > you: we're seeing a lot of time spent in rcu_barrier() from > > > > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch > > > > of network connections for VMs. I'm trying to make this go faster. > > > > > > > > My patch is pasted below, which speeds things up drastically, by a > > > > factor of 10,000 or so in most cases. My question is about initializing > > > > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and > > > > initialized it from get_state_synchronize_rcu() when the structure is > > > > first initialized. That patch didn't really help. I suspected this was > > > > because the time between init and calling mini_qdisc_pair_swap() was > > > > short enough that a grace period had not elapsed, so in the patch below > > > > rcu_state is initialized to 0 (due to the fact that the memory is > > > > kzalloc-ed). I think this should be safe, based on my understanding of > > > > how the cookie works and the fact that no other task yet has a reference > > > > to either buffer. Is this correct? > > > > > > Let me see if I understand your reasoning. Right after creating the > > > mini Qdisc, someone might be accessing the current buffer, but no one > > > ever was accessing the non-current buffer. So on the initial swap > > > (and -only- on the initial swap), there are no readers to wait for. > > > Is that your line of reasoning? If so, sort-of OK, but let's talk about > > > more organized ways of handling this. > > > > That's almost right, just missing a couple of details. Initially neither > > buffer (let's call them miniq1 and miniq2 for clarity) is active, and > > *p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there > > can be no readers for either. On the first swap miniq1 becomes active, > > and there are no readers to wait for. On the second swap there are no > > readers of miniq2 to wait for, and at this time miniq1->rcu_state will > > be set. On the third swap we might finally have readers of miniq1 to > > wait for. > > > > Also, this function is a little different from the way your examples are > > structured in that oldstate is saved at the end of the function, and the > > wait on oldstate happens in a subsequent call to the function. So it's > > more like: > > > > oldstate = start_poll_synchronize_rcu(); > > return; > > > > // Some unknown amount of time passes ... > > > > cond_synchronize_rcu(oldstate); > > > > So I suspect that it wouldn't be uncommon for enough time to have passed > > that there are no readers to wait for. > > > > But that brings us to an important point. I'm not a networking expert, I > > don't know what common patterns are for these swaps, so I'm probably > > ill-equipped to say what makes the most sense beyond the specific > > scenario I'm trying to improve. These swaps will happen when setting up > > QoS on a network interface, so if I had to guess I'd think the most > > common pattern would be short bursts of activity when interfaces are > > set up. I know the pattern in the use case I've been debugging, which > > sees a lot of "first swaps" for different instances of mini_Qdisc_pair, > > and a smaller number of "second swaps" about 200 ms after the first, > > which means we're paying a high price for the rcu_barrier() when there > > can be no readers. Other scenaios might result in multiple swaps on the > > same instance within a short time period, but I can't say for sure. > > Very good! We can pool our ignorance of networking! ;-) > > If the common case really is mostly just the one swap, I can easily give > you an API that hands you an already-expired "oldstate". Full disclosure: > On 32-bit systems, after 2^31 grace periods (maybe as quickly as three > months of runtime), the "oldstate" would no longer appear to be at > all old. And would take another 2^31 grace periods to look old again. > > However, that is an unusual API, so I would want to know that it was > really useful before creating it. > > Plus if you start early enough at boot, the value of zero will not look > old, just so you know. ;-) Of course. However, when I was initializing mini_Qdisc.rcu_state from get_state_synchronize_rcu() along with cond_synchronize_rcu(), I was still seeing many calls (50% or more) to mini_qdisc_pair_swap() take tens of milliseconds in my use case. When I initialized rcu_state to 0 I saw virtually no times over 1 ms. Trading a high probability of a long delay for a low probability seems like an improvement to me. :-) I think the new API is probably unnecessary though. I have some notes from testing the other two approaches below. > > > However, another approach is something like this: > > > > > > oldstate = start_poll_synchronize_rcu(); > > > > > > // do something. > > > > > > if (!poll_state_synchronize_rcu(oldstate)) > > > synchronize_rcu_expedited(); > > > > > > This will be roughly an order of magnitude faster than you would see with > > > your current cond_synchronize_rcu(). And it will speed up all swaps, > > > not just the first one. On the other hand, synchronize_rcu_expedited() > > > sends IPIs to all non-idle CPUs, which is often OK, but not so much for > > > heavy real-time workloads. > > > > > > So you can let the heavy real-time workloads > > > have slowish Qdisc switching and IPI the IPI-insensitive workloads: > > > > > > oldstate = start_poll_synchronize_rcu(); > > > > > > // do something. > > > > > > if (IS_ENABLED(CONFIG_PREEMPT_RT)) > > > cond_synchronize_rcu(oldstate); > > > else if (!poll_state_synchronize_rcu(oldstate)) > > > synchronize_rcu_expedited(); > > > > I was concerned about the system-wide impact synchronize_rcu_expedited() > > has, but this at least avoids doing it if it's unnecessary. If it's bad > > form to leave mini_Qdisc.rcu_state initialized to 0, this option is > > preferable from my perspective. > > This might well be the best approach if multiple swaps are at all common. > > But if needed, we really could do both: Make initial swaps really fast, > and improve the performance of later swaps. Using this the time for mini_qdisc_pair_swap() to complete was similar to my earlier patch with mini_Qdisc.rcu_state initialized to 0. So this seems acceptible for my purposes. > > > Yet another approach is this: > > > > > > oldstate = start_poll_synchronize_rcu(); > > > > > > // do something. > > > > > > while (!poll_state_synchronize_rcu(oldstate)) > > > schedule_timeout_idle(1); > > > > > > The advantage of this one is that you get credit for partial progress > > > through grace periods, but it won't be anywhere near as fast as > > > synchronize_rcu_expedited(). > > > > If we need to initialize mini_Qdisc.rcu_state up front this approach > > probably won't improve things much for my problem. > > It all depends on how long between swaps. ;-) This approach helps, but I still see about 50% of the mini_qdisc_pair_swap() calls taking tens of milliseconds. This is similar to what I saw using cond_synchronize_rcu() with rcu_state initialized from get_state_synchronize_rcu(). Wouldn't this also behave very poorly in the wrapping case you mentioned above? In the end, I'm happy to move forward using synchronize_rcu_expedited(). I'm going to test a bit more, and I should have a patch out sometime tomorrow. Thanks, Seth