Hi Paul, On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:Hi Paul, On 11/18/2019 8:38 PM, Paul E. McKenney wrote:On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:Hi Paul, On 11/18/2019 3:06 AM, Paul E. McKenney wrote:On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:For the tasks waiting in exp_wq inside exp_funnel_lock(), there is a chance that they might be indefinitely blocked in below scenario: 1. There is a task waiting on exp sequence 0b'100' inside exp_funnel_lock(). _synchronize_rcu_expedited()This symbol went away a few versions back, but let's see how this plays out in current -rcu.Sorry; for us this problem is observed on 4.19 stable version; I had checked against the -rcu code, and the relevant portions were present there.s = 0b'100 exp_funnel_lock() wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]All of the above could still happen if the expedited grace period number was zero (or a bit less) when that task invokedYessynchronize_rcu_expedited(). What is the relation, if any, between this task and "task1" below? Seems like you want them to be different tasks.This task is the one which is waiting for the expedited sequence, which "task1" completes ("task1" holds the exp_mutex for it). "task1" would wake up this task, on exp GP completion.Does this task actually block, or is it just getting ready to block? Seems like you need it to have actually blocked.Yes, it actually blocked in wait queue.2. The Exp GP completes and task (task1) holding exp_mutex queues worker and schedules out."The Exp GP" being the one that was initiated when the .expedited_sequence counter was zero, correct? (Looks that way below.)Yes, correct._synchronize_rcu_expedited() s = 0b'100 queue_work(rcu_gp_wq, &rew.rew_work) wake_up_worker() schedule() 3. kworker A picks up the queued work and completes the exp gp sequence. rcu_exp_wait_wake() rcu_exp_wait_wake() rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented // to 0b'100' 4. task1 does not enter wait queue, as sync_exp_work_done() returns true, and releases exp_mutex. wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], sync_exp_work_done(rsp, s)); mutex_unlock(&rsp->exp_mutex);So task1 is the one that initiated the expedited grace period that started when .expedited_sequence was zero, right?Yes, right.5. Next exp GP completes, and sequence number is incremented: rcu_exp_wait_wake() rcu_exp_wait_wake() rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200' 6. As kworker A uses current expedited_sequence, it wakes up workers from wrong wait queue index - it should have worken wait queue corresponding to 0b'100' sequence, but wakes up the ones for 0b'200' sequence. This results in task at step 1 indefinitely blocked. rcu_exp_wait_wake() wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);So the issue is that the next expedited RCU grace period might have completed before the completion of the wakeups for the previous expedited RCU grace period, correct? Then expedited grace periods haveYes. Actually from the ftraces, I saw that next expedited RCU grace period completed while kworker A was in D state, while waiting for exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for its wake_up_all() call; so, task (point 1) was never woken up, as it was waiting on wq index 1.to have stopped to prevent any future wakeup from happening, correct? (Which would make it harder for rcutorture to trigger this, though it really does have code that attempts to trigger this sort of thing.) Is this theoretical in nature, or have you actually triggered it? If actually triggered, what did you do to make this happen?This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel), another instance in Nov 2018 (on 4.14 kernel), in our customer reported issues. Both instances were in downstream drivers and we didn't have RCU traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces enabled, where it was observed in suspend scenario, where we are observing "DPM device timeout" [1], as scsi device is stuck in _synchronize_rcu_expedited(). schedule+0x70/0x90 _synchronize_rcu_expedited+0x590/0x5f8 synchronize_rcu+0x50/0xa0 scsi_device_quiesce+0x50/0x120 scsi_bus_suspend+0x70/0xe8 dpm_run_callback+0x148/0x388 __device_suspend+0x430/0x8a8 [1] https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489What have you done to test the change?I have given this for testing; will share the results . Current analysis and patch is based on going through ftrace and code review.OK, very good. Please include the failure information in the changelog of the next version of this patch.
Done.
I prefer your original patch, that just uses "s", over the one below that moves the rcu_exp_gp_seq_end(). The big advantage of your original patch is that it allow more concurrency between a consecutive pair of expedited RCU grace periods. Plus it would not be easy to convince myself that moving rcu_exp_gp_seq_end() down is safe, so your original is also conceptually simpler with a more manageable state space.
The reason for highlighting the alternate approach of doing gp end inside exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case; please correct me if I am wrong here:
1. task0 holds exp_wake_mutex, and is preempted. 2. task1 initiates new GP (current seq number = 0). 3. task1 queues worker kworker1 and schedules out. 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex 5. task1 releases exp mutex, w/o entering waitq. 6. task2 initiates new GP (current seq number = 1). 7. task2 queues worker kworker1 and schedules out. 8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex. ... This sequence would break the requirement of max 3 wqs.If we hold the exp_wake_mutex before exp seq end, there will be atmost three pending GPs - 1. GP which current owner is doing wakeups for,
2. GP which the waiter of exp_wake_mutex would complete, 3. Next GPwaiters, which started after GP @ point 2 had started. This also is inline with the statement in [1]:
"he key point is that the ->exp_mutex is not released until the first wakeup is complete, which means that the ->exp_wake_mutex has already been acquired at that point"[1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
Please also add the WARN_ON(), though at first glance your change seems to have lost the wakeup. (But it is early, so maybe it is just that I am not yet fully awake.)My bad, I posted incomplete diff in previous mail: static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) { struct rcu_node *rnp; + unsigned long exp_low; + unsigned long s_low = rcu_seq_ctr(s) & 0x3; synchronize_sched_expedited_wait(rsp); rcu_exp_gp_seq_end(rsp); @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) spin_unlock(&rnp->exp_lock); } smp_mb(); /* All above changes before wakeup. */ - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]); + exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3; + WARN_ON(s_low != exp_low); + wake_up_all(&rnp->exp_wq[exp_low]);Much better! But I suggest using s_low in the wake_up_all. This hunk is of course only for testing purposes, not for actual inclusion. (My earlier email didn't make that clear.) Thanx, Paul
Ok sure, got it. I will share the results, once the issue is reproduced. Thanks Neeraj
} Thanks NeerajThanx, PaulI was thinking of another way of addressing this problem: Doing exp seq end inside exp_wake_mutex. This will also ensure that, if we extend the current scenario and there are multiple expedited GP sequence, which have completed, before exp_wake_mutex is held, we need to preserve the requirement of 3 wq entries [2]. [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) struct rcu_node *rnp; synchronize_sched_expedited_wait(rsp); - rcu_exp_gp_seq_end(rsp); - trace_rcu_exp_grace_period(rsp->name, s, TPS("end")); /* * Switch over to wakeup mode, allowing the next GP, but -only- the @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) */ mutex_lock(&rsp->exp_wake_mutex); + rcu_exp_gp_seq_end(rsp); + trace_rcu_exp_grace_period(rsp->name, s, TPS("end")); +(Using a WARN_ON() to check for the lower bits of the counter portion of rcu_state.expedited_sequence differing from the same bits of s would be one way to detect this problem.) Thanx, PaulI have also given the patch for this, for testing: static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) { struct rcu_node *rnp; + unsigned long exp_low; + unsigned long s_low = rcu_seq_ctr(s) & 0x3; synchronize_sched_expedited_wait(rsp); rcu_exp_gp_seq_end(rsp); @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) spin_unlock(&rnp->exp_lock); } smp_mb(); /* All above changes before wakeup. */ - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]); + exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3; + WARN_ON(s_low != exp_low); + Thanks NeerajFix this by using the correct sequence number for wake_up_all() inside rcu_exp_wait_wake(). Signed-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> --- kernel/rcu/tree_exp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index e4b77d3..28979d3 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s) spin_unlock(&rnp->exp_lock); } smp_mb(); /* All above changes before wakeup. */ - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]); + wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]); } trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake")); mutex_unlock(&rcu_state.exp_wake_mutex); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation-- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation-- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
--QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation