Hi Paul, On Thu, May 25, 2017 at 02:59:18PM -0700, Paul E. McKenney wrote: > Wait/wakeup operations do not guarantee ordering on their own. Instead, > either locking or memory barriers are required. This commit therefore > adds memory barriers to wake_nocb_leader() and nocb_leader_wait(). > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> > --- > kernel/rcu/tree_plugin.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 0b1042545116..573fbe9640a0 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force) > if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) { > /* Prior smp_mb__after_atomic() orders against prior enqueue. */ > WRITE_ONCE(rdp_leader->nocb_leader_sleep, false); > + smp_mb(); /* ->nocb_leader_sleep before swake_up(). */ > swake_up(&rdp_leader->nocb_wq); > } > } > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > * nocb_gp_head, where they await a grace period. > */ > gotcbs = false; > + smp_mb(); /* wakeup before ->nocb_head reads. */ > for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) { > rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head); > if (!rdp->nocb_gp_head) May I impose upon you to CC this patch to stable, and tag it as fixing abedf8e241? I ran into this on a production 4.9 branch. When I debugged it, I discovered that it went all the way back to 4.6. The tl;dr is that at least for some environments, the missed wakeup manifests itself as a series of hung-task warnings to console and if I'm unlucky it can also generate a hang that can block interactive logins via ssh. In case it's useful, I'm attching the RCA that I sent out before I circulated a similar patch for internal review. You beat me to the punch in terms of getting a patch out a public mailing list. --- >8 cut here 8< --- Production boxes go through fits where they generate spurts of hung task warnings. Sometimes, the slowdown is enough to ensnare a process holding the rtnl_lock, or prevent systemd from letting ssh logins on to the box. Below is a splat from a relatively benign ocurrence: INFO: task uc-spawn:34443 blocked for more than 120 seconds. Not tainted 4.9.4-1.el7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. uc-spawn D 0 34443 1 0x00000082 ffff883fec88e4c0 0000000000000000 ffff883ff1b4ac00 ffff883ffddd9540 ffff883fee999600 ffffc9001c907c30 ffffffff8175e3f6 0000000000000202 ffff883fee999600 7fffffffffffffff ffff883fee999600 7fffffffffffffff Call Trace: [<ffffffff8175e3f6>] ? __schedule+0x1a6/0x650 [<ffffffff8175e8d6>] schedule+0x36/0x80 [<ffffffff81761aec>] schedule_timeout+0x21c/0x3a0 [<ffffffff8175f3b2>] wait_for_completion+0xf2/0x130 [<ffffffff810b6710>] ? wake_up_q+0x80/0x80 [<ffffffff810eec27>] __wait_rcu_gp+0xd7/0x100 [<ffffffff810f264e>] synchronize_sched+0x5e/0x80 [<ffffffff810f1da0>] ? __call_rcu+0x320/0x320 [<ffffffff810eeb30>] ? trace_raw_output_rcu_utilization+0x60/0x60 [<ffffffff8125673b>] kern_unmount+0x2b/0x40 [<ffffffff812d6f85>] mq_put_mnt+0x15/0x20 [<ffffffff812d76de>] put_ipc_ns+0x3e/0xa0 [<ffffffff810ab768>] free_nsproxy+0x38/0x90 [<ffffffff810ab8c9>] switch_task_namespaces+0x59/0x60 [<ffffffff810ab8e0>] exit_task_namespaces+0x10/0x20 [<ffffffff8108d787>] do_exit+0x2b7/0xac0 [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0 [<ffffffff8108e00f>] do_group_exit+0x3f/0xb0 [<ffffffff8108e094>] SyS_exit_group+0x14/0x20 [<ffffffff81003a47>] do_syscall_64+0x67/0x180 [<ffffffff8176306b>] entry_SYSCALL64_slow_path+0x25/0x25 In order to debug this further, a kdump was obtained. crash> ps 34443 PID PPID CPU TASK ST %MEM VSZ RSS COMM 34443 1 33 ffff883fee999600 UN 0.0 0 0 uc-spawn crash> bt 34443 PID: 34443 TASK: ffff883fee999600 CPU: 33 COMMAND: "uc-spawn" #0 [ffffc9001c907bb8] __schedule at ffffffff8175e3f6 #1 [ffffc9001c907c38] schedule at ffffffff8175e8d6 #2 [ffffc9001c907c50] schedule_timeout at ffffffff81761aec #3 [ffffc9001c907cf0] wait_for_completion at ffffffff8175f3b2 #4 [ffffc9001c907d50] __wait_rcu_gp at ffffffff810eec27 #5 [ffffc9001c907da0] synchronize_sched at ffffffff810f264e #6 [ffffc9001c907df0] kern_unmount at ffffffff8125673b #7 [ffffc9001c907e08] mq_put_mnt at ffffffff812d6f85 #8 [ffffc9001c907e18] put_ipc_ns at ffffffff812d76de #9 [ffffc9001c907e30] free_nsproxy at ffffffff810ab768 #10 [ffffc9001c907e48] switch_task_namespaces at ffffffff810ab8c9 #11 [ffffc9001c907e70] exit_task_namespaces at ffffffff810ab8e0 #12 [ffffc9001c907e80] do_exit at ffffffff8108d787 #13 [ffffc9001c907ee8] do_group_exit at ffffffff8108e00f #14 [ffffc9001c907f18] sys_exit_group at ffffffff8108e094 #15 [ffffc9001c907f28] do_syscall_64 at ffffffff81003a47 RIP: 00007f6b6e1f4209 RSP: 00007ffc488028b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f6b6e1f4209 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007f6b6e4ea840 R8: 000000000000003c R9: 00000000000000e7 R10: fffffffffffffe90 R11: 0000000000000246 R12: 00007f6b6e4ea840 R13: 00007f6b6e4efe80 R14: 00007ffc48802918 R15: 00007ffc48802998 ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b This particular process is blocked in its exit path, waiting for the synchronize_sched() call it invoked to complete. This function queues a call_rcu_sched() invocation with a callback that triggers a wakeup of the invoking process. In order for this process to become unblocked, a grace period must elapse, and a the callback must be called. Typically, bugs like this imply that some processor failed to yield, or some thread is holding rcu_read_lock_sched(), and hence preventing a grace period from occurring. However, after dilligent inspection I can find no evidence of such a thing occurring. After dumping all stacks, no threads are in call frames that hold this type of read-side critical section. All cpus appear to be mostly idle and responding to interrupt events. The rcu threads themselves are not blocked waiting for completion of a grace period either. A quick look at rcu_sched_state shows nothing obviously asmiss: crash> p rcu_sched_state rcu_sched_state = $1 = { node = {{ lock = { raw_lock = { val = { counter = 0 } } }, gpnum = 22469944, completed = 22469944, qsmask = 0, qsmaskinit = 3, qsmaskinitnext = 0, expmask = 0, expmaskinit = 3, expmaskinitnext = 0, grpmask = 0, grplo = 0, grphi = 71, grpnum = 0 '\000', level = 0 '\000', wait_blkd_tasks = false, parent = 0x0, blkd_tasks = { next = 0xffffffff81c70428 <rcu_sched_state+104>, prev = 0xffffffff81c70428 <rcu_sched_state+104> }, gp_tasks = 0x0, exp_tasks = 0x0, boost_tasks = 0x0, <...> Note that according to this, the most recent gp completed was numbered 22469944. Now here's the rcu_data for CPU 33, where the process got stuck. It's at least 15 grace periods behind. crash> struct rcu_data 0xffff883ffddda380 struct rcu_data { completed = 22469929, gpnum = 22469930, rcu_qs_ctr_snap = 18203094, cpu_no_qs = { b = { norm = 0 '\000', exp = 0 '\000' }, s = 0 }, core_needs_qs = true, beenonline = true, gpwrap = false, mynode = 0xffffffff81c70600 <rcu_sched_state+576>, grpmask = 8589934592, ticks_this_gp = 0, nxtlist = 0x0, nxttail = {0x0, 0x0, 0x0, 0x0}, nxtcompleted = {0, 0, 0, 0}, qlen_lazy = 0, qlen = 0, qlen_last_fqs_check = 0, n_cbs_invoked = 0, n_nocbs_invoked = 10626703, n_cbs_orphaned = 0, n_cbs_adopted = 0, n_force_qs_snap = 0, blimit = 10, dynticks = 0xffff883ffddcffd0, dynticks_snap = 28482762, dynticks_fqs = 19371, offline_fqs = 0, cond_resched_completed = 22421984, n_rcu_pending = 2265627, n_rp_core_needs_qs = 31314, n_rp_report_qs = 595411, n_rp_cb_ready = 0, n_rp_cpu_needs_gp = 757, n_rp_gp_completed = 367886, n_rp_gp_started = 38620, n_rp_nocb_defer_wakeup = 126, n_rp_need_nothing = 1262827, barrier_head = { next = 0x0, func = 0xffffffff810f0090 <rcu_barrier_callback> }, exp_workdone0 = { counter = 21444 }, exp_workdone1 = { counter = 0 }, exp_workdone2 = { counter = 0 }, exp_workdone3 = { counter = 0 }, nocb_head = 0xffff881f8444a138, nocb_tail = 0xffff881f9f8ee8d0, nocb_q_count = { counter = 500 }, nocb_q_count_lazy = { counter = 18 }, nocb_follower_head = 0x0, nocb_follower_tail = 0xffff883ffddda508, nocb_wq = { lock = { raw_lock = { val = { counter = 0 } } }, task_list = { next = 0xffffc900193e3e80, prev = 0xffffc900193e3e80 } }, nocb_kthread = 0xffff883ff0861600, nocb_defer_wakeup = 0, nocb_gp_head = 0xffff883fe18c71a0, nocb_gp_tail = 0xffffc9001c7a3da0, nocb_leader_sleep = false, nocb_next_follower = 0xffff883ffde1a380, nocb_leader = 0xffff883ffdd9a380, softirq_snap = 1002642, cpu = 33, rsp = 0xffffffff81c703c0 <rcu_sched_state> } Notice also that the nocb_q_count is 500. With a bit of looking we can see that the blocked process has its callback queued in both the nocb_gp_head list and the nocb_head list. I'll spare readers all of the output, but the correct linkage was located by walking the list entries and then grepping for the correct pointer. crash> struct completion 0xffffc9001c907dc0 struct completion { done = 0, wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffffc9001c907d10, prev = 0xffffc9001c907d10 } } } crash> struct rcu_synchronize 0xffffc9001c907db0 struct rcu_synchronize { head = { next = 0xffff883fe4d18398, func = 0xffffffff810eeb30 <wakeme_after_rcu> }, completion = { done = 0, wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffffc9001c907d10, prev = 0xffffc9001c907d10 } } } } This is the stack for CPU 33's nocb kthread. It's sleeping. crash> bt 0xffff883ff0861600 PID: 279 TASK: ffff883ff0861600 CPU: 39 COMMAND: "rcuos/33" #0 [ffffc900193e3db0] __schedule at ffffffff8175e3f6 #1 [ffffc900193e3e30] schedule at ffffffff8175e8d6 #2 [ffffc900193e3e48] rcu_nocb_kthread at ffffffff810f2a67 #3 [ffffc900193e3ec8] kthread at ffffffff810a9e19 #4 [ffffc900193e3f50] ret_from_fork at ffffffff81763215 Since CPU 33 is a follower in this configuration, it needs to be awoken by a leader. In this case CPU 32 is the leader assigned to wake CPU 33's nocb_kthread. It's also sleeping. crash> bt 0xffff883ff0805800 PID: 271 TASK: ffff883ff0805800 CPU: 38 COMMAND: "rcuos/32" #0 [ffffc900193a3db0] __schedule at ffffffff8175e3f6 #1 [ffffc900193a3e30] schedule at ffffffff8175e8d6 #2 [ffffc900193a3e48] rcu_nocb_kthread at ffffffff810f2ad5 #3 [ffffc900193a3ec8] kthread at ffffffff810a9e19 #4 [ffffc900193a3f50] ret_from_fork at ffffffff81763215 Here's a red flag: The leader is asleep despite nocb_leader_sleep being false. The code in nocb_leader_wait() has an swait_event_interruptible() that's supposed to only let this thread wait when nocb_leader_sleep is true. crash> struct rcu_data.nocb_leader_sleep 0xffff883ffdd9a380 nocb_leader_sleep = false The leader's wq contains active entries, so something is sleeping here: crash> struct rcu_data.nocb_wq 0xffff883ffdd9a380 nocb_wq = { lock = { raw_lock = { val = { counter = 0 } } }, task_list = { next = 0xffffc900193a3e80, prev = 0xffffc900193a3e80 } } crash> struct -o rcu_data.nocb_wq 0xffff883ffdd9a380 struct rcu_data { [ffff883ffdd9a518] struct swait_queue_head nocb_wq; } crash> struct list_head 0xffffc900193a3e80 struct list_head { next = 0xffff883ffdd9a520, prev = 0xffff883ffdd9a520 } This is the waitq of the callback parked by the process that was stranded on cpu 33. It's also correctly configured. crash> struct wait_queue_t 0xffffc9001c907cf8 struct wait_queue_t { flags = 1, private = 0xffff883fee999600, func = 0xffffffff810b6710 <default_wake_function>, task_list = { next = 0xffffc9001c907dd0, prev = 0xffffc9001c907dd0 } } crash> ps -m 34443 [ 0 00:11:07.318] [UN] PID: 34443 TASK: ffff883fee999600 CPU: 33 COMMAND: "uc-spawn" crash> ps -m 0xffff883ff0861600 [ 0 00:11:44.271] [IN] PID: 279 TASK: ffff883ff0861600 CPU: 39 COMMAND: "rcuos/33" crash> ps -m ffff883ff0805800 [ 0 00:11:44.208] [IN] PID: 271 TASK: ffff883ff0805800 CPU: 38 COMMAND: "rcuos/32" Looking at the PS data, we can see that our processes have been stranded for more than 11 minutes. The rcuos threads for 37 seconds longer than pid 34443. The reason this appears to be a transient hang is due to the code in __call_rcu_nocb_enqueue(). We seem to be missing the initial wakeup, and once that happens, we'll only send another wakeup if nocb_q_count is greater than qlen_last_fqs_check + qhimark. In our case, we need to accumulate 10,000 callbacks before we get another bite at this. Since these cpus are relatively idle, we've only accrued 500 on the follower and ~2000 on the leader. Leader's full state if below: crash> struct rcu_data 0xffff883ffdd9a380 struct rcu_data { completed = 22469935, gpnum = 22469935, rcu_qs_ctr_snap = 17067139, cpu_no_qs = { b = { norm = 0 '\000', exp = 0 '\000' }, s = 0 }, core_needs_qs = false, beenonline = true, gpwrap = false, mynode = 0xffffffff81c70600 <rcu_sched_state+576>, grpmask = 4294967296, ticks_this_gp = 1, nxtlist = 0x0, nxttail = {0x0, 0x0, 0x0, 0x0}, nxtcompleted = {18446744073709551526, 18446744073709551526, 18446744073709551526, 0}, qlen_lazy = 0, qlen = 0, qlen_last_fqs_check = 0, n_cbs_invoked = 0, n_nocbs_invoked = 10730040, n_cbs_orphaned = 0, n_cbs_adopted = 0, n_force_qs_snap = 0, blimit = 10, dynticks = 0xffff883ffdd8ffd0, dynticks_snap = 29638122, dynticks_fqs = 19616, offline_fqs = 0, cond_resched_completed = 22467956, n_rcu_pending = 2314079, n_rp_core_needs_qs = 30976, n_rp_report_qs = 654184, n_rp_cb_ready = 0, n_rp_cpu_needs_gp = 813, n_rp_gp_completed = 369917, n_rp_gp_started = 37049, n_rp_nocb_defer_wakeup = 135, n_rp_need_nothing = 1251981, barrier_head = { next = 0x0, func = 0xffffffff810f0090 <rcu_barrier_callback> }, exp_workdone0 = { counter = 20577 }, exp_workdone1 = { counter = 0 }, exp_workdone2 = { counter = 0 }, exp_workdone3 = { counter = 0 }, nocb_head = 0xffff883ff0f6f700, nocb_tail = 0xffff881f9f8ed2d0, nocb_q_count = { counter = 2029 }, nocb_q_count_lazy = { counter = 71 }, nocb_follower_head = 0x0, nocb_follower_tail = 0xffff883ffdd9a508, nocb_wq = { lock = { raw_lock = { val = { counter = 0 } } }, task_list = { next = 0xffffc900193a3e80, prev = 0xffffc900193a3e80 } }, nocb_kthread = 0xffff883ff0805800, nocb_defer_wakeup = 0, nocb_gp_head = 0x0, nocb_gp_tail = 0xffff883feb6c4d00, nocb_leader_sleep = false, nocb_next_follower = 0xffff883ffddda380, nocb_leader = 0xffff883ffdd9a380, softirq_snap = 1061978, cpu = 32, rsp = 0xffffffff81c703c0 <rcu_sched_state> } The nocb path can miss a wakeup sent to the master. If the system is fairly idle it can take a substantial amount of time in order for qhimark to be reached so that we kick the leader again. It might be possible to tune qhimark, but this is intellectually unsatisfying, especially since that tunable has use outside of the nocb path. Instead, it looks like in the conversion to simple wait queues the sematics of swake_up differ slightly from wake_up and the rcu_nocb code failed to make a necessary adjustment. Here's the wakeup path: static void wake_nocb_leader(struct rcu_data *rdp, bool force) { struct rcu_data *rdp_leader = rdp->nocb_leader; if (!READ_ONCE(rdp_leader->nocb_kthread)) return; if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) { /* Prior smp_mb__after_atomic() orders against prior enqueue. */ WRITE_ONCE(rdp_leader->nocb_leader_sleep, false); swake_up(&rdp_leader->nocb_wq); } } And here's swake_up iteslf: void swake_up(struct swait_queue_head *q) { unsigned long flags; if (!swait_active(q)) return; raw_spin_lock_irqsave(&q->lock, flags); swake_up_locked(q); raw_spin_unlock_irqrestore(&q->lock, flags); } It's important to note that swait_active() isn't called with any locks held so we're in a situation where the write to nocb_leader_sleep and, potentially, a store to q->task_list are in flight on different CPUs at the same time. A comment in wake_nocb_leader notes that the smp_mb__after_atomic() gives us ordering against the prior enqueue, but that's a red herring. In order to wake the leader, the waker's store to nocb_leader_sleep must be visibile on the CPU where the awaking process can observe it. The documentation in memory-barriers.txt cautions us in this way: [!] Note that the memory barriers implied by the sleeper and the waker do _not_ order multiple stores before the wake-up with respect to loads of those stored values after the sleeper has called set_current_state(). For instance, if the sleeper does: set_current_state(TASK_INTERRUPTIBLE); if (event_indicated) break; __set_current_state(TASK_RUNNING); do_something(my_data); and the waker does: my_data = value; event_indicated = 1; wake_up(&event_wait_queue); there's no guarantee that the change to event_indicated will be perceived by the sleeper as coming after the change to my_data. In such a circumstance, the code on both sides must interpolate its own memory barriers between the separate data accesses. Thus the above sleeper ought to do: set_current_state(TASK_INTERRUPTIBLE); if (event_indicated) { smp_rmb(); do_something(my_data); } and the waker should do: my_data = value; smp_wmb(); event_indicated = 1; wake_up(&event_wait_queue); [https://www.kernel.org/doc/Documentation/memory-barriers.txt] The RT-linux tree had a patch in its swait_head_has_waiters() function to ensure that wakeups were not missed: https://www.spinics.net/lists/linux-rt-users/msg10340.html That barrier was dropped from the patch as part of getting it ready for mainline: http://www.spinics.net/lists/linux-rt-users/msg13691.html The patch in the prior message seemed to have a comment, though it's possible it was subsequently dropped. Also note that waitqueue_active in the current code base contains the following comment: /** * waitqueue_active -- locklessly test for waiters on the queue * @q: the waitqueue to test for waiters * * returns true if the wait list is not empty * * NOTE: this function is lockless and requires care, incorrect usage _will_ * lead to sporadic and non-obvious failure. * * Use either while holding wait_queue_head_t::lock or when used for wakeups * with an extra smp_mb() like: * * CPU0 - waker CPU1 - waiter * * for (;;) { * @cond = true; prepare_to_wait(&wq, &wait, state); * smp_mb(); // smp_mb() from set_current_state() * if (waitqueue_active(wq)) if (@cond) * wake_up(wq); break; * schedule(); * } * finish_wait(&wq, &wait); * * Because without the explicit smp_mb() it's possible for the * waitqueue_active() load to get hoisted over the @cond store such that we'll * observe an empty wait list while the waiter might not observe @cond. * * Also note that this 'optimization' trades a spin_lock() for an smp_mb(), * which (when the lock is uncontended) are of roughly equal cost. */ [include/linux/wait.h] That's exactly what we're seeing go wrong here. The waiter didn't see the condtion become true as it was going to sleep, and the waker didn't see the waiter as being asleep. Based upon all of this, it's pretty clear that we need a memory barrier added to wake_nocb_leader(). The good news is that most other users of swake_up(), are already using other atomic operations that cause updates to their values to become visibile. The fix should be something as simple as this: diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 56583e7..1322d0b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1765,8 +1765,8 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force) if (!READ_ONCE(rdp_leader->nocb_kthread)) return; if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) { - /* Prior smp_mb__after_atomic() orders against prior enqueue. */ - WRITE_ONCE(rdp_leader->nocb_leader_sleep, false); + /* mb ensures nocb_leader_sleep is observed prior to wakeup */ + smp_store_mb(rdp_leader->nocb_leader_sleep, false); swake_up(&rdp_leader->nocb_wq); } } --- >8 cut here 8< --- Thanks, -K