On 11/22/2019 12:25 AM, Paul E. McKenney wrote:
On Thu, Nov 21, 2019 at 10:06:40AM +0530, Neeraj Upadhyay wrote:On 11/21/2019 9:40 AM, Paul E. McKenney wrote:On Wed, Nov 20, 2019 at 10:38:56AM +0530, Neeraj Upadhyay wrote:On 11/20/2019 1:36 AM, Paul E. McKenney wrote:On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:Hi Paul, On 11/19/2019 9:35 AM, Paul E. McKenney wrote:On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:Hi Paul, On 11/18/2019 10:54 PM, Paul E. McKenney wrote:On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:Hi Paul, On 11/18/2019 8:38 PM, Paul E. McKenney wrote:On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:Hi Paul, On 11/18/2019 3:06 AM, Paul E. McKenney wrote:On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:For the tasks waiting in exp_wq inside exp_funnel_lock(), there is a chance that they might be indefinitely blocked in below scenario: 1. There is a task waiting on exp sequence 0b'100' inside exp_funnel_lock(). _synchronize_rcu_expedited()This symbol went away a few versions back, but let's see how this plays out in current -rcu.Sorry; for us this problem is observed on 4.19 stable version; I had checked against the -rcu code, and the relevant portions were present there.s = 0b'100 exp_funnel_lock() wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]All of the above could still happen if the expedited grace period number was zero (or a bit less) when that task invokedYessynchronize_rcu_expedited(). What is the relation, if any, between this task and "task1" below? Seems like you want them to be different tasks.This task is the one which is waiting for the expedited sequence, which "task1" completes ("task1" holds the exp_mutex for it). "task1" would wake up this task, on exp GP completion.Does this task actually block, or is it just getting ready to block? Seems like you need it to have actually blocked.Yes, it actually blocked in wait queue.2. The Exp GP completes and task (task1) holding exp_mutex queues worker and schedules out."The Exp GP" being the one that was initiated when the .expedited_sequence counter was zero, correct? (Looks that way below.)Yes, correct._synchronize_rcu_expedited() s = 0b'100 queue_work(rcu_gp_wq, &rew.rew_work) wake_up_worker() schedule() 3. kworker A picks up the queued work and completes the exp gp sequence. rcu_exp_wait_wake() rcu_exp_wait_wake() rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented // to 0b'100' 4. task1 does not enter wait queue, as sync_exp_work_done() returns true, and releases exp_mutex. wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], sync_exp_work_done(rsp, s)); mutex_unlock(&rsp->exp_mutex);So task1 is the one that initiated the expedited grace period that started when .expedited_sequence was zero, right?Yes, right.5. Next exp GP completes, and sequence number is incremented: rcu_exp_wait_wake() rcu_exp_wait_wake() rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200' 6. As kworker A uses current expedited_sequence, it wakes up workers from wrong wait queue index - it should have worken wait queue corresponding to 0b'100' sequence, but wakes up the ones for 0b'200' sequence. This results in task at step 1 indefinitely blocked. rcu_exp_wait_wake() wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);So the issue is that the next expedited RCU grace period might have completed before the completion of the wakeups for the previous expedited RCU grace period, correct? Then expedited grace periods haveYes. Actually from the ftraces, I saw that next expedited RCU grace period completed while kworker A was in D state, while waiting for exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for its wake_up_all() call; so, task (point 1) was never woken up, as it was waiting on wq index 1.to have stopped to prevent any future wakeup from happening, correct? (Which would make it harder for rcutorture to trigger this, though it really does have code that attempts to trigger this sort of thing.) Is this theoretical in nature, or have you actually triggered it? If actually triggered, what did you do to make this happen?This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel), another instance in Nov 2018 (on 4.14 kernel), in our customer reported issues. Both instances were in downstream drivers and we didn't have RCU traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces enabled, where it was observed in suspend scenario, where we are observing "DPM device timeout" [1], as scsi device is stuck in _synchronize_rcu_expedited(). schedule+0x70/0x90 _synchronize_rcu_expedited+0x590/0x5f8 synchronize_rcu+0x50/0xa0 scsi_device_quiesce+0x50/0x120 scsi_bus_suspend+0x70/0xe8 dpm_run_callback+0x148/0x388 __device_suspend+0x430/0x8a8 [1] https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489What have you done to test the change?I have given this for testing; will share the results . Current analysis and patch is based on going through ftrace and code review.OK, very good. Please include the failure information in the changelog of the next version of this patch.Done.I prefer your original patch, that just uses "s", over the one below that moves the rcu_exp_gp_seq_end(). The big advantage of your original patch is that it allow more concurrency between a consecutive pair of expedited RCU grace periods. Plus it would not be easy to convince myself that moving rcu_exp_gp_seq_end() down is safe, so your original is also conceptually simpler with a more manageable state space.The reason for highlighting the alternate approach of doing gp end inside exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case; please correct me if I am wrong here: 1. task0 holds exp_wake_mutex, and is preempted.Presumably after it has awakened the kthread that initiated the prior expedited grace period (the one with seq number = -4).2. task1 initiates new GP (current seq number = 0).Yes, this can happen.3. task1 queues worker kworker1 and schedules out.And thus still holds .exp_mutex, but yes.4. kworker1 sets exp GP to 1 and waits on exp_wake_mutexAnd thus cannot yet have awakened task1.5. task1 releases exp mutex, w/o entering waitq.So I do not believe that we can get to #5. What am I missing here?As mentioned in this patch, task1 could have scheduled out after queuing work: queue_work(rcu_gp_wq, &rew.rew_work) wake_up_worker() schedule() kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits on exp_wake_mutex. task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return true and it does not enter wait queue and releases exp_mutex. wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], sync_exp_work_done(rsp, s));Well, I have certainly given enough people a hard time about missing the didn't-actually-sleep case, so good show on finding one in my code! ;-) Which also explains why deferring the rcu_exp_gp_seq_end() is safe: The .exp_mutex won't be released until after it happens, and the next manipulation of the sequence number cannot happen until after .exp_mutex is next acquired. Good catch! And keep up the good work!!!And here is the commit corresponding to your earlier patch. Please let me know of any needed adjustments. Thanx, Paul ------------------------------------------------------------------------ commit 3ec440b52831eea172061c5db3d2990b22904863 Author: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> Date: Tue Nov 19 11:50:52 2019 -0800 rcu: Allow only one expedited GP to run concurrently with wakeups The current expedited RCU grace-period code expects that a task requesting an expedited grace period cannot awaken until that grace period has reached the wakeup phase. However, it is possible for a long preemption to result in the waiting task never sleeping. For example, consider the following sequence of events: 1. Task A starts an expedited grace period by invoking synchronize_rcu_expedited(). It proceeds normally up to the wait_event() near the end of that function, and is then preempted (or interrupted or whatever). 2. The expedited grace period completes, and a kworker task starts the awaken phase, having incremented the counter and acquired the rcu_state structure's .exp_wake_mutex. This kworker task is then preempted or interrupted or whatever. 3. Task A resumes and enters wait_event(), which notes that the expedited grace period has completed, and thus doesn't sleep. 4. Task B starts an expedited grace period exactly as did Task A, complete with the preemption (or whatever delay) just before the call to wait_event(). 5. The expedited grace period completes, and another kworker task starts the awaken phase, having incremented the counter. However, it blocks when attempting to acquire the rcu_state structure's .exp_wake_mutex because step 2's kworker task has not yet released it. 6. Steps 4 and 5 repeat, resulting in overflow of the rcu_node structure's ->exp_wq[] array. In theory, this is harmless. Tasks waiting on the various ->exp_wq[] array will just be spuriously awakened, but they will just sleep again on noting that the rcu_state structure's ->expedited_sequence value has not advanced far enough. In practice, this wastes CPU time and is an accident waiting to happen. This commit therefore moves the rcu_exp_gp_seq_end() call that officially ends the expedited grace period (along with associate tracing) until after the ->exp_wake_mutex has been acquired. This prevents Task A from awakening prematurely, thus preventing more than one expedited grace period from being in flight during a previous expedited grace period's wakeup phase.I am not sure, if a "fixes" tag is required for it.If you have a suggested commit, I would be happy to add it.I think either or below 2 - first one is on the tree_exp.h file, second one looks to be the original commit. 1. https://github.com/torvalds/linux/commit/3549c2bc2c4ea8ecfeb9d21cb81cb00c6002b011 2. https://github.com/torvalds/linux/commit/3b5f668e715bc19610ad967ef97a7e8c55a186ecAgreed, this second commit is the one that introduced the bug. I placed "Fixes:" tags on both of your commits for this one. And thank you for digging them both up! Thanx, Paul
No problem.
Thanks NeerajSigned-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> [ paulmck: Added updated comment. ] Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 4433d00a..8840729 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s) struct rcu_node *rnp; synchronize_sched_expedited_wait(); - rcu_exp_gp_seq_end(); - trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end")); - /* - * Switch over to wakeup mode, allowing the next GP, but -only- the - * next GP, to proceed. - */ + // Switch over to wakeup mode, allowing the next GP to proceed. + // End the previous grace period only after acquiring the mutex + // to ensure that only one GP runs concurrently with wakeups.Should comment style be changed to below? /* Switch over to wakeup mode, allowing the next GP to proceed. * End the previous grace period only after acquiring the mutex * to ensure that only one GP runs concurrently with wakeups. */No. "//" is acceptable comment format, aside from docbook headers. The "//" approach saves three characters per line compared to "/* ... */" single-line comments and a line compared to the style you show above. So yes, some maintainers prefer the style you show, but not me. Thanx, Paul
Got it. Thanks Neeraj
mutex_lock(&rcu_state.exp_wake_mutex); + rcu_exp_gp_seq_end(); + trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end")); rcu_for_each_node_breadth_first(rnp) { if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {-- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation-- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
--QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation