Re: [arm64] db410c: BUG: Invalid wait context

Mark Rutland <mark.rutland@xxxxxxx> · Thu, 3 Dec 2020 11:39:15 +0000

Hi Naresh, Boqun,

On Thu, Dec 03, 2020 at 09:49:22AM +0800, Boqun Feng wrote:
> On Wed, Dec 02, 2020 at 10:15:44AM +0530, Naresh Kamboju wrote:
> > While running kselftests on arm64 db410c platform "BUG: Invalid wait context"
> > noticed at different runs this specific platform running stable-rc 5.9.12-rc1.
> > 
> > While running these two test cases we have noticed this BUG and not easily
> > reproducible.
> > 
> > # selftests: bpf: test_xdp_redirect.sh
> > # selftests: net: ip6_gre_headroom.sh
> > 
> > [  245.694901] kauditd_printk_skb: 100 callbacks suppressed
> > [  245.694913] audit: type=1334 audit(251.699:25757): prog-id=12883 op=LOAD
> > [  245.735658] audit: type=1334 audit(251.743:25758): prog-id=12884 op=LOAD
> > [  245.801299] audit: type=1334 audit(251.807:25759): prog-id=12885 op=LOAD
> > [  245.832034] audit: type=1334 audit(251.839:25760): prog-id=12886 op=LOAD
> > [  245.888601]
> > [  245.888631] =============================
> > [  245.889156] [ BUG: Invalid wait context ]
> > [  245.893071] 5.9.12-rc1 #1 Tainted: G        W
> > [  245.897056] -----------------------------
> > [  245.902091] pool/1279 is trying to lock:
> > [  245.906083] ffff000032fc1218
> > (&child->perf_event_mutex){+.+.}-{3:3}, at:
> > perf_event_exit_task+0x34/0x3a8
> > [  245.910085] other info that might help us debug this:
> > [  245.919539] context-{4:4}
> > [  245.924484] 1 lock held by pool/1279:
> > [  245.927087]  #0: ffff8000127819b8 (rcu_read_lock){....}-{1:2}, at:
> > dput+0x54/0x460
> > [  245.930739] stack backtrace:
> > [  245.938203] CPU: 1 PID: 1279 Comm: pool Tainted: G        W
> > 5.9.12-rc1 #1
> > [  245.941243] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> > [  245.948621] Call trace:
> > [  245.955390]  dump_backtrace+0x0/0x1f8
> > [  245.957560]  show_stack+0x2c/0x38
> > [  245.961382]  dump_stack+0xec/0x158
> > [  245.964679]  __lock_acquire+0x59c/0x15c8
> > [  245.967978]  lock_acquire+0x124/0x4d0
> > [  245.972058]  __mutex_lock+0xa4/0x970
> > [  245.975615]  mutex_lock_nested+0x54/0x70
> > [  245.979261]  perf_event_exit_task+0x34/0x3a8
> > [  245.983168]  do_exit+0x394/0xad8
> > [  245.987420]  do_group_exit+0x4c/0xa8
> > [  245.990633]  get_signal+0x16c/0xb40
> > [  245.994193]  do_notify_resume+0x2ec/0x678
> > [  245.997404]  work_pending+0x8/0x200
> > 
> 
> For the PoV of lockdep, this means some one tries to acquire a mutex
> inside an RCU read-side critical section, which is bad, because one can
> not sleep (voluntarily) inside RCU.
> 
> However I don't think it's the true case here, because 1) normally
> people are very careful not putting mutex or other sleepable locks
> inside RCU and 2) in the above splats, lockdep find the rcu read lock is
> held at dput() while the acquiring of the mutex is at ret_to_user(),
> clearly there is no call site (in the same context) from the RCU
> read-side critial section of dput() to ret_to_user().
> 
> One chance of hitting this is that there is a bug in context/irq tracing
> that makes the contexts of dput() and ret_to_user() as one contexts so
> that lockdep gets confused and reports a false postive.

That sounds likely to me (but I haven't looked too deeply at the above
report).

> FWIW, I think this might be related to some know issues for ARM64 with
> lockdep and irq tracing:
> 
> 	https://lore.kernel.org/lkml/20201119225352.GA5251@willie-the-truck/
> 
> And Mark already has series to fix them:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/irq-fixes
> 
> But I must defer to Mark for the latest fix ;-)

That went into mainline a few hours ago, and will be part of v5.10-rc7.

So if it's possible to test with mainline, that would be helpful!

Thanks,
Mark.