Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters

Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> · Fri, 22 Nov 2019 03:44:41 +0000

On 11/22/2019 12:25 AM, Paul E. McKenney wrote:
On Thu, Nov 21, 2019 at 10:06:40AM +0530, Neeraj Upadhyay wrote:


On 11/21/2019 9:40 AM, Paul E. McKenney wrote:
On Wed, Nov 20, 2019 at 10:38:56AM +0530, Neeraj Upadhyay wrote:


On 11/20/2019 1:36 AM, Paul E. McKenney wrote:
On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
Hi Paul,

On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
Hi Paul,

On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
Hi Paul,


On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
Hi Paul,

On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
For the tasks waiting in exp_wq inside exp_funnel_lock(),
there is a chance that they might be indefinitely blocked
in below scenario:

1. There is a task waiting on exp sequence 0b'100' inside
          exp_funnel_lock().

          _synchronize_rcu_expedited()

This symbol went away a few versions back, but let's see how this
plays out in current -rcu.


Sorry; for us this problem is observed on 4.19 stable version; I had
checked against the -rcu code, and the relevant portions were present
there.

            s = 0b'100
            exp_funnel_lock()
              wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]

All of the above could still happen if the expedited grace
period number was zero (or a bit less) when that task invoked

Yes

synchronize_rcu_expedited().  What is the relation, if any,
between this task and "task1" below?  Seems like you want them to
be different tasks.


This task is the one which is waiting for the expedited sequence, which
"task1" completes ("task1" holds the exp_mutex for it). "task1" would
wake up this task, on exp GP completion.

Does this task actually block, or is it just getting ready
to block?  Seems like you need it to have actually blocked.


Yes, it actually blocked in wait queue.

2. The Exp GP completes and task (task1) holding exp_mutex queues
          worker and schedules out.

"The Exp GP" being the one that was initiated when the .expedited_sequence
counter was zero, correct?  (Looks that way below.)

Yes, correct.

          _synchronize_rcu_expedited()
            s = 0b'100
            queue_work(rcu_gp_wq, &rew.rew_work)
              wake_up_worker()
                schedule()

3. kworker A picks up the queued work and completes the exp gp
          sequence.

          rcu_exp_wait_wake()
            rcu_exp_wait_wake()
              rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
                                      // to 0b'100'

4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
          and releases exp_mutex.

          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
            sync_exp_work_done(rsp, s));
          mutex_unlock(&rsp->exp_mutex);

So task1 is the one that initiated the expedited grace period that
started when .expedited_sequence was zero, right?


Yes, right.

5. Next exp GP completes, and sequence number is incremented:

          rcu_exp_wait_wake()
            rcu_exp_wait_wake()
              rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'

6. As kworker A uses current expedited_sequence, it wakes up workers
          from wrong wait queue index - it should have worken wait queue
          corresponding to 0b'100' sequence, but wakes up the ones for
          0b'200' sequence. This results in task at step 1 indefinitely blocked.

          rcu_exp_wait_wake()
            wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);

So the issue is that the next expedited RCU grace period might
have completed before the completion of the wakeups for the previous
expedited RCU grace period, correct?  Then expedited grace periods have

Yes. Actually from the ftraces, I saw that next expedited RCU grace
period completed while kworker A was in D state, while waiting for
exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
its wake_up_all() call; so, task (point 1) was never woken up, as it was
waiting on wq index 1.

to have stopped to prevent any future wakeup from happening, correct?
(Which would make it harder for rcutorture to trigger this, though it
really does have code that attempts to trigger this sort of thing.)

Is this theoretical in nature, or have you actually triggered it?
If actually triggered, what did you do to make this happen?

This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
another instance in Nov 2018 (on 4.14 kernel), in our customer reported
issues. Both instances were in downstream drivers and we didn't have RCU
traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
enabled, where it was observed in suspend scenario, where we are observing
"DPM device timeout" [1], as scsi device is stuck in
_synchronize_rcu_expedited().

schedule+0x70/0x90
_synchronize_rcu_expedited+0x590/0x5f8
synchronize_rcu+0x50/0xa0
scsi_device_quiesce+0x50/0x120
scsi_bus_suspend+0x70/0xe8
dpm_run_callback+0x148/0x388
__device_suspend+0x430/0x8a8

[1]
https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489

What have you done to test the change?


I have given this for testing; will share the results . Current analysis
and patch is based on going through ftrace and code review.

OK, very good.  Please include the failure information in the changelog
of the next version of this patch.

Done.


I prefer your original patch, that just uses "s", over the one below
that moves the rcu_exp_gp_seq_end().  The big advantage of your original
patch is that it allow more concurrency between a consecutive pair of
expedited RCU grace periods.  Plus it would not be easy to convince
myself that moving rcu_exp_gp_seq_end() down is safe, so your original
is also conceptually simpler with a more manageable state space.

The reason for highlighting the alternate approach of doing gp end inside
exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
please correct me if I am wrong here:

1. task0 holds exp_wake_mutex, and is preempted.

Presumably after it has awakened the kthread that initiated the prior
expedited grace period (the one with seq number = -4).

2. task1 initiates new GP (current seq number = 0).

Yes, this can happen.

3. task1 queues worker kworker1 and schedules out.

And thus still holds .exp_mutex, but yes.

4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex

And thus cannot yet have awakened task1.

5. task1 releases exp mutex, w/o entering waitq.

So I do not believe that we can get to #5.  What am I missing here?


As mentioned in this patch, task1 could have scheduled out after queuing
work:

queue_work(rcu_gp_wq, &rew.rew_work)
              wake_up_worker()
                schedule()

kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
on exp_wake_mutex.

task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
true and it does not enter wait queue and releases exp_mutex.

wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
            sync_exp_work_done(rsp, s));

Well, I have certainly given enough people a hard time about missing the
didn't-actually-sleep case, so good show on finding one in my code!  ;-)

Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
The .exp_mutex won't be released until after it happens, and the
next manipulation of the sequence number cannot happen until after
.exp_mutex is next acquired.

Good catch!  And keep up the good work!!!

And here is the commit corresponding to your earlier patch.  Please let
me know of any needed adjustments.

							Thanx, Paul

------------------------------------------------------------------------

commit 3ec440b52831eea172061c5db3d2990b22904863
Author: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>
Date:   Tue Nov 19 11:50:52 2019 -0800

       rcu: Allow only one expedited GP to run concurrently with wakeups
       The current expedited RCU grace-period code expects that a task
       requesting an expedited grace period cannot awaken until that grace
       period has reached the wakeup phase.  However, it is possible for a long
       preemption to result in the waiting task never sleeping.  For example,
       consider the following sequence of events:
       1.      Task A starts an expedited grace period by invoking
               synchronize_rcu_expedited().  It proceeds normally up to the
               wait_event() near the end of that function, and is then preempted
               (or interrupted or whatever).
       2.      The expedited grace period completes, and a kworker task starts
               the awaken phase, having incremented the counter and acquired
               the rcu_state structure's .exp_wake_mutex.  This kworker task
               is then preempted or interrupted or whatever.
       3.      Task A resumes and enters wait_event(), which notes that the
               expedited grace period has completed, and thus doesn't sleep.
       4.      Task B starts an expedited grace period exactly as did Task A,
               complete with the preemption (or whatever delay) just before
               the call to wait_event().
       5.      The expedited grace period completes, and another kworker
               task starts the awaken phase, having incremented the counter.
               However, it blocks when attempting to acquire the rcu_state
               structure's .exp_wake_mutex because step 2's kworker task has
               not yet released it.
       6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
               structure's ->exp_wq[] array.
       In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
       array will just be spuriously awakened, but they will just sleep again
       on noting that the rcu_state structure's ->expedited_sequence value has
       not advanced far enough.
       In practice, this wastes CPU time and is an accident waiting to happen.
       This commit therefore moves the rcu_exp_gp_seq_end() call that officially
       ends the expedited grace period (along with associate tracing) until
       after the ->exp_wake_mutex has been acquired.  This prevents Task A from
       awakening prematurely, thus preventing more than one expedited grace
       period from being in flight during a previous expedited grace period's
       wakeup phase.

I am not sure, if a "fixes" tag is required for it.

If you have a suggested commit, I would be happy to add it.


I think either or below 2 - first one is on the tree_exp.h file, second
one looks to be the original commit.

1. https://github.com/torvalds/linux/commit/3549c2bc2c4ea8ecfeb9d21cb81cb00c6002b011

2. https://github.com/torvalds/linux/commit/3b5f668e715bc19610ad967ef97a7e8c55a186ec

Agreed, this second commit is the one that introduced the bug.  I placed
"Fixes:" tags on both of your commits for this one.  And thank you for
digging them both up!

							Thanx, Paul


No problem.

Thanks
Neeraj

       Signed-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>
       [ paulmck: Added updated comment. ]
       Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 4433d00a..8840729 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
    	struct rcu_node *rnp;
    	synchronize_sched_expedited_wait();
-	rcu_exp_gp_seq_end();
-	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
-	/*
-	 * Switch over to wakeup mode, allowing the next GP, but -only- the
-	 * next GP, to proceed.
-	 */
+	// Switch over to wakeup mode, allowing the next GP to proceed.
+	// End the previous grace period only after acquiring the mutex
+	// to ensure that only one GP runs concurrently with wakeups.

Should comment style be changed to below?

/* Switch over to wakeup mode, allowing the next GP to proceed.
   * End the previous grace period only after acquiring the mutex
   * to ensure that only one GP runs concurrently with wakeups.
   */

No.  "//" is acceptable comment format, aside from docbook headers.
The "//" approach saves three characters per line compared to "/* ... */"
single-line comments and a line compared to the style you show above.

So yes, some maintainers prefer the style you show, but not me.

							Thanx, Paul


Got it.


Thanks
Neeraj

    	mutex_lock(&rcu_state.exp_wake_mutex);
+	rcu_exp_gp_seq_end();
+	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
    	rcu_for_each_node_breadth_first(rnp) {
    		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation