On Thu, Aug 01, 2019 at 03:50:20PM -0700, Paul E. McKenney wrote: > Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads > are divided into groups. The first rcuo kthread to come online in a > given group is that group's leader, and the leader both waits for grace > periods and invokes its CPU's callbacks. The non-leader rcuo kthreads > only invoke callbacks. > > This works well in the real-time/embedded environments for which it was > intended because such environments tend not to generate all that many > callbacks. However, given huge floods of callbacks, it is possible for > the leader kthread to be stuck invoking callbacks while its followers > wait helplessly while their callbacks pile up. This is a good recipe > for an OOM, and rcutorture's new callback-flood capability does generate > such OOMs. > > One strategy would be to wait until such OOMs start happening in > production, but similar OOMs have in fact happened starting in 2018. > It would therefore be wise to take a more proactive approach. I haven't looked much into nocbs/nohz_full stuff (yet). In particular, I did not even know that the rcuo threads do grace period life-cycle management and waiting, I thought only the RCU GP threads did :-/. however, it seems this is a completely separate grace-period management state machine outside of the RCU GP thread right? I was wondering for this patch, could we also just have the rcuo leader handle both callback execution and waking other non-leader threads at the same time? So like, execute few callbacks, then do the wake up of the non-leaders to execute their callbacks, the get back to executing their own callbacks, etc. That way we don't need a separate rcuog thread to wait for grace period, would that not work? If you don't mind could you share with me a kvm.sh command (which has config, boot parameters etc) that can produce the OOM without this patch? I'd like to take a closer look at it. Is there also a short answer for my the RCU GP thread cannot do the job of these new rcuog threads? thanks a lot, - Joel > This commit therefore features per-CPU rcuo kthreads that do nothing > but invoke callbacks. Instead of having one of these kthreads act as > leader, each group has a separate rcog kthread that handles grace periods > for its group. Because these rcuog kthreads do not invoke callbacks, > callback floods on one CPU no longer block callbacks from reaching the > rcuc callback-invocation kthreads on other CPUs. > > This change does introduce additional kthreads, however: > > 1. The number of additional kthreads is about the square root of > the number of CPUs, so that a 4096-CPU system would have only > about 64 additional kthreads. Note that recent changes > decreased the number of rcuo kthreads by a factor of two > (CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so > this still represents a significant improvement on most systems. > > 2. The leading "rcuo" of the rcuog kthreads should allow existing > scripting to affinity these additional kthreads as needed, the > same as for the rcuop and rcuos kthreads. (There are no longer > any rcuob kthreads.) > > 3. A state-machine approach was considered and rejected. Although > this would allow the rcuo kthreads to continue their dual > leader/follower roles, it complicates callback invocation > and makes it more difficult to consolidate rcuo callback > invocation with existing softirq callback invocation. > > The introduction of rcuog kthreads should thus be acceptable. > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx> > --- > kernel/rcu/tree.h | 6 +- > kernel/rcu/tree_plugin.h | 115 +++++++++++++++++++-------------------- > 2 files changed, 61 insertions(+), 60 deletions(-) > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index 32b3348d3a4d..dc3c53cb9608 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -200,8 +200,8 @@ struct rcu_data { > atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ > struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ > struct rcu_head **nocb_cb_tail; > - struct swait_queue_head nocb_wq; /* For nocb kthreads to sleep on. */ > - struct task_struct *nocb_cb_kthread; > + struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */ > + struct task_struct *nocb_gp_kthread; > raw_spinlock_t nocb_lock; /* Guard following pair of fields. */ > int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ > struct timer_list nocb_timer; /* Enforce finite deferral. */ > @@ -211,6 +211,8 @@ struct rcu_data { > /* CBs waiting for GP. */ > struct rcu_head **nocb_gp_tail; > bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ > + struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ > + struct task_struct *nocb_cb_kthread; > struct rcu_data *nocb_next_cb_rdp; > /* Next rcu_data in wakeup chain. */ > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 5a72700c3a32..c3b6493313ab 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -1531,7 +1531,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, > struct rcu_data *rdp_leader = rdp->nocb_gp_rdp; > > lockdep_assert_held(&rdp->nocb_lock); > - if (!READ_ONCE(rdp_leader->nocb_cb_kthread)) { > + if (!READ_ONCE(rdp_leader->nocb_gp_kthread)) { > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > return; > } > @@ -1541,7 +1541,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, > del_timer(&rdp->nocb_timer); > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > smp_mb(); /* ->nocb_gp_sleep before swake_up_one(). */ > - swake_up_one(&rdp_leader->nocb_wq); > + swake_up_one(&rdp_leader->nocb_gp_wq); > } else { > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > } > @@ -1646,7 +1646,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, > smp_mb__after_atomic(); /* Store *old_rhpp before _wake test. */ > > /* If we are not being polled and there is a kthread, awaken it ... */ > - t = READ_ONCE(rdp->nocb_cb_kthread); > + t = READ_ONCE(rdp->nocb_gp_kthread); > if (rcu_nocb_poll || !t) { > trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, > TPS("WakeNotPoll")); > @@ -1786,7 +1786,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp) > * No-CBs GP kthreads come here to wait for additional callbacks to show up. > * This function does not return until callbacks appear. > */ > -static void nocb_leader_wait(struct rcu_data *my_rdp) > +static void nocb_gp_wait(struct rcu_data *my_rdp) > { > bool firsttime = true; > unsigned long flags; > @@ -1794,12 +1794,10 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > struct rcu_data *rdp; > struct rcu_head **tail; > > -wait_again: > - > /* Wait for callbacks to appear. */ > if (!rcu_nocb_poll) { > trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("Sleep")); > - swait_event_interruptible_exclusive(my_rdp->nocb_wq, > + swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq, > !READ_ONCE(my_rdp->nocb_gp_sleep)); > raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); > my_rdp->nocb_gp_sleep = true; > @@ -1838,7 +1836,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, > TPS("WokeEmpty")); > } > - goto wait_again; > + return; > } > > /* Wait for one grace period. */ > @@ -1862,34 +1860,47 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > rdp->nocb_cb_tail = rdp->nocb_gp_tail; > *tail = rdp->nocb_gp_head; > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > - if (rdp != my_rdp && tail == &rdp->nocb_cb_head) { > + if (tail == &rdp->nocb_cb_head) { > /* List was empty, so wake up the kthread. */ > - swake_up_one(&rdp->nocb_wq); > + swake_up_one(&rdp->nocb_cb_wq); > } > } > +} > > - /* If we (the GP kthreads) don't have CBs, go wait some more. */ > - if (!my_rdp->nocb_cb_head) > - goto wait_again; > +/* > + * No-CBs grace-period-wait kthread. There is one of these per group > + * of CPUs, but only once at least one CPU in that group has come online > + * at least once since boot. This kthread checks for newly posted > + * callbacks from any of the CPUs it is responsible for, waits for a > + * grace period, then awakens all of the rcu_nocb_cb_kthread() instances > + * that then have callback-invocation work to do. > + */ > +static int rcu_nocb_gp_kthread(void *arg) > +{ > + struct rcu_data *rdp = arg; > + > + for (;;) > + nocb_gp_wait(rdp); > + return 0; > } > > /* > * No-CBs CB kthreads come here to wait for additional callbacks to show up. > - * This function does not return until callbacks appear. > + * This function returns true ("keep waiting") until callbacks appear and > + * then false ("stop waiting") when callbacks finally do appear. > */ > -static void nocb_follower_wait(struct rcu_data *rdp) > +static bool nocb_follower_wait(struct rcu_data *rdp) > { > - for (;;) { > - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); > - swait_event_interruptible_exclusive(rdp->nocb_wq, > - READ_ONCE(rdp->nocb_cb_head)); > - if (smp_load_acquire(&rdp->nocb_cb_head)) { > - /* ^^^ Ensure CB invocation follows _head test. */ > - return; > - } > - WARN_ON(signal_pending(current)); > - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); > + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); > + swait_event_interruptible_exclusive(rdp->nocb_cb_wq, > + READ_ONCE(rdp->nocb_cb_head)); > + if (smp_load_acquire(&rdp->nocb_cb_head)) { /* VVV */ > + /* ^^^ Ensure CB invocation follows _head test. */ > + return false; > } > + WARN_ON(signal_pending(current)); > + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); > + return true; > } > > /* > @@ -1899,7 +1910,7 @@ static void nocb_follower_wait(struct rcu_data *rdp) > * have to do quite so many wakeups (as in they only need to wake the > * no-CBs GP kthreads, not the CB kthreads). > */ > -static int rcu_nocb_kthread(void *arg) > +static int rcu_nocb_cb_kthread(void *arg) > { > int c, cl; > unsigned long flags; > @@ -1911,10 +1922,8 @@ static int rcu_nocb_kthread(void *arg) > /* Each pass through this loop invokes one batch of callbacks */ > for (;;) { > /* Wait for callbacks. */ > - if (rdp->nocb_gp_rdp == rdp) > - nocb_leader_wait(rdp); > - else > - nocb_follower_wait(rdp); > + while (nocb_follower_wait(rdp)) > + continue; > > /* Pull the ready-to-invoke callbacks onto local list. */ > raw_spin_lock_irqsave(&rdp->nocb_lock, flags); > @@ -2048,7 +2057,8 @@ void __init rcu_init_nohz(void) > static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) > { > rdp->nocb_tail = &rdp->nocb_head; > - init_swait_queue_head(&rdp->nocb_wq); > + init_swait_queue_head(&rdp->nocb_cb_wq); > + init_swait_queue_head(&rdp->nocb_gp_wq); > rdp->nocb_cb_tail = &rdp->nocb_cb_head; > raw_spin_lock_init(&rdp->nocb_lock); > timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); > @@ -2056,50 +2066,39 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) > > /* > * If the specified CPU is a no-CBs CPU that does not already have its > - * rcuo kthread, spawn it. If the CPUs are brought online out of order, > - * this can require re-organizing the GP-CB relationships. > + * rcuo CB kthread, spawn it. Additionally, if the rcuo GP kthread > + * for this CPU's group has not yet been created, spawn it as well. > */ > static void rcu_spawn_one_nocb_kthread(int cpu) > { > - struct rcu_data *rdp; > - struct rcu_data *rdp_last; > - struct rcu_data *rdp_old_leader; > - struct rcu_data *rdp_spawn = per_cpu_ptr(&rcu_data, cpu); > + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > + struct rcu_data *rdp_gp; > struct task_struct *t; > > /* > * If this isn't a no-CBs CPU or if it already has an rcuo kthread, > * then nothing to do. > */ > - if (!rcu_is_nocb_cpu(cpu) || rdp_spawn->nocb_cb_kthread) > + if (!rcu_is_nocb_cpu(cpu) || rdp->nocb_cb_kthread) > return; > > /* If we didn't spawn the GP kthread first, reorganize! */ > - rdp_old_leader = rdp_spawn->nocb_gp_rdp; > - if (rdp_old_leader != rdp_spawn && !rdp_old_leader->nocb_cb_kthread) { > - rdp_last = NULL; > - rdp = rdp_old_leader; > - do { > - rdp->nocb_gp_rdp = rdp_spawn; > - if (rdp_last && rdp != rdp_spawn) > - rdp_last->nocb_next_cb_rdp = rdp; > - if (rdp == rdp_spawn) { > - rdp = rdp->nocb_next_cb_rdp; > - } else { > - rdp_last = rdp; > - rdp = rdp->nocb_next_cb_rdp; > - rdp_last->nocb_next_cb_rdp = NULL; > - } > - } while (rdp); > - rdp_spawn->nocb_next_cb_rdp = rdp_old_leader; > + rdp_gp = rdp->nocb_gp_rdp; > + if (!rdp_gp->nocb_gp_kthread) { > + t = kthread_run(rcu_nocb_gp_kthread, rdp_gp, > + "rcuog/%d", rdp_gp->cpu); > + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo GP kthread, OOM is now expected behavior\n", __func__)) > + return; > + WRITE_ONCE(rdp_gp->nocb_gp_kthread, t); > } > > /* Spawn the kthread for this CPU. */ > - t = kthread_run(rcu_nocb_kthread, rdp_spawn, > + t = kthread_run(rcu_nocb_cb_kthread, rdp, > "rcuo%c/%d", rcu_state.abbr, cpu); > - if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo kthread, OOM is now expected behavior\n", __func__)) > + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo CB kthread, OOM is now expected behavior\n", __func__)) > return; > - WRITE_ONCE(rdp_spawn->nocb_cb_kthread, t); > + WRITE_ONCE(rdp->nocb_cb_kthread, t); > + WRITE_ONCE(rdp->nocb_gp_kthread, rdp_gp->nocb_gp_kthread); > } > > /* > -- > 2.17.1 >