Re: Something is leaking RCU holds from interrupt context

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Sun, 4 Apr 2021 19:24:53 +0100




On Sun, Apr 04, 2021 at 09:48:08AM -0700, Paul E. McKenney wrote:
> On Sun, Apr 04, 2021 at 11:24:57AM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 03, 2021 at 09:15:17PM -0700, syzbot wrote:
> > > HEAD commit:    2bb25b3a Merge tag 'mips-fixes_5.12_3' of git://git.kernel..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=1284cc31d00000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=dde0cc33951735441301
> > > 
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+dde0cc33951735441301@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 
> > > WARNING: suspicious RCU usage
> > > 5.12.0-rc5-syzkaller #0 Not tainted
> > > -----------------------------
> > > kernel/sched/core.c:8294 Illegal context switch in RCU-bh read-side critical section!
> > > 
> > > other info that might help us debug this:
> > > 
> > > 
> > > rcu_scheduler_active = 2, debug_locks = 0
> > > no locks held by systemd-udevd/4825.
> > 
> > I think we have something that's taking the RCU read lock in
> > (soft?) interrupt context and not releasing it properly in all
> > situations.  This thread doesn't have any locks recorded, but
> > lock_is_held(&rcu_bh_lock_map) is true.
> > 
> > Is there some debugging code that could find this?  eg should
> > lockdep_softirq_end() check that rcu_bh_lock_map is not held?
> > (if it's taken in process context, then BHs can't run, so if it's
> > held at softirq exit, then there's definitely a problem).
> 
> Something like the (untested) patch below?

Maybe?  Will this tell us who took the lock?  I was really trying to
throw out a suggestion in the hope that somebody who knows this area
better than I do would tell me I was wrong.

> Please note that it does not make sense to also check for
> either rcu_lock_map or rcu_sched_lock_map because either of
> these might be held by the interrupted code.

Yes!  Although if we do it somewhere like tasklet_action_common(),
we could do something like:

+++ b/kernel/softirq.c
@@ -774,6 +774,7 @@ static void tasklet_action_common(struct softirq_action *a,
 
        while (list) {
                struct tasklet_struct *t = list;
+               unsigned long rcu_lockdep = rcu_get_lockdep_state();
 
                list = list->next;
 
@@ -790,6 +791,10 @@ static void tasklet_action_common(struct softirq_action *a,
                        }
                        tasklet_unlock(t);
                }
+               if (rcu_lockdep != rcu_get_lockdep_state()) {
+                       printk(something useful about t);
+                       RCU_LOCKDEP_WARN(... something else useful ...);
+               }
 
                local_irq_disable();

where rcu_get_lockdep_state() returns a bitmap of whether the four rcu
lockdep maps are held.

We might also need something similar in __do_softirq(), in case it's
not a tasklet that's the problem.