Hi Naresh, Boqun, On Thu, Dec 03, 2020 at 09:49:22AM +0800, Boqun Feng wrote: > On Wed, Dec 02, 2020 at 10:15:44AM +0530, Naresh Kamboju wrote: > > While running kselftests on arm64 db410c platform "BUG: Invalid wait context" > > noticed at different runs this specific platform running stable-rc 5.9.12-rc1. > > > > While running these two test cases we have noticed this BUG and not easily > > reproducible. > > > > # selftests: bpf: test_xdp_redirect.sh > > # selftests: net: ip6_gre_headroom.sh > > > > [ 245.694901] kauditd_printk_skb: 100 callbacks suppressed > > [ 245.694913] audit: type=1334 audit(251.699:25757): prog-id=12883 op=LOAD > > [ 245.735658] audit: type=1334 audit(251.743:25758): prog-id=12884 op=LOAD > > [ 245.801299] audit: type=1334 audit(251.807:25759): prog-id=12885 op=LOAD > > [ 245.832034] audit: type=1334 audit(251.839:25760): prog-id=12886 op=LOAD > > [ 245.888601] > > [ 245.888631] ============================= > > [ 245.889156] [ BUG: Invalid wait context ] > > [ 245.893071] 5.9.12-rc1 #1 Tainted: G W > > [ 245.897056] ----------------------------- > > [ 245.902091] pool/1279 is trying to lock: > > [ 245.906083] ffff000032fc1218 > > (&child->perf_event_mutex){+.+.}-{3:3}, at: > > perf_event_exit_task+0x34/0x3a8 > > [ 245.910085] other info that might help us debug this: > > [ 245.919539] context-{4:4} > > [ 245.924484] 1 lock held by pool/1279: > > [ 245.927087] #0: ffff8000127819b8 (rcu_read_lock){....}-{1:2}, at: > > dput+0x54/0x460 > > [ 245.930739] stack backtrace: > > [ 245.938203] CPU: 1 PID: 1279 Comm: pool Tainted: G W > > 5.9.12-rc1 #1 > > [ 245.941243] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) > > [ 245.948621] Call trace: > > [ 245.955390] dump_backtrace+0x0/0x1f8 > > [ 245.957560] show_stack+0x2c/0x38 > > [ 245.961382] dump_stack+0xec/0x158 > > [ 245.964679] __lock_acquire+0x59c/0x15c8 > > [ 245.967978] lock_acquire+0x124/0x4d0 > > [ 245.972058] __mutex_lock+0xa4/0x970 > > [ 245.975615] mutex_lock_nested+0x54/0x70 > > [ 245.979261] perf_event_exit_task+0x34/0x3a8 > > [ 245.983168] do_exit+0x394/0xad8 > > [ 245.987420] do_group_exit+0x4c/0xa8 > > [ 245.990633] get_signal+0x16c/0xb40 > > [ 245.994193] do_notify_resume+0x2ec/0x678 > > [ 245.997404] work_pending+0x8/0x200 > > > > For the PoV of lockdep, this means some one tries to acquire a mutex > inside an RCU read-side critical section, which is bad, because one can > not sleep (voluntarily) inside RCU. > > However I don't think it's the true case here, because 1) normally > people are very careful not putting mutex or other sleepable locks > inside RCU and 2) in the above splats, lockdep find the rcu read lock is > held at dput() while the acquiring of the mutex is at ret_to_user(), > clearly there is no call site (in the same context) from the RCU > read-side critial section of dput() to ret_to_user(). > > One chance of hitting this is that there is a bug in context/irq tracing > that makes the contexts of dput() and ret_to_user() as one contexts so > that lockdep gets confused and reports a false postive. That sounds likely to me (but I haven't looked too deeply at the above report). > FWIW, I think this might be related to some know issues for ARM64 with > lockdep and irq tracing: > > https://lore.kernel.org/lkml/20201119225352.GA5251@willie-the-truck/ > > And Mark already has series to fix them: > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/irq-fixes > > But I must defer to Mark for the latest fix ;-) That went into mainline a few hours ago, and will be part of v5.10-rc7. So if it's possible to test with mainline, that would be helpful! Thanks, Mark.