On Sat, Feb 04, 2023 at 11:21:27PM +0900, Tetsuo Handa wrote: > On 2023/02/04 22:47, Greg Kroah-Hartman wrote: > > On Sat, Feb 04, 2023 at 10:32:11PM +0900, Tetsuo Handa wrote: > >> Hello. > >> > >> There is a long-standing deadlock problem in driver core code caused by > >> "struct device"->mutex being marked as "do not apply lockdep checks". > > > > The marking of a lock does not cause a deadlock problem, so what do you > > mean exactly by this? Where is the actual deadlock? > > A few of examples: > > https://syzkaller.appspot.com/bug?extid=2d6ac90723742279e101 It's hard to figure out what's wrong from looking at the syzbot report. What makes you think it is connected with dev->mutex? At first glance, it seems that the ath6kl driver is trying to flush a workqueue while holding a lock or mutex that is needed by one of the jobs in the workqueue. That's obviously never going to work, no matter what sort of lockdep validation gets used. > https://syzkaller.appspot.com/bug?extid=2e39bc6569d281acbcfb > https://syzkaller.appspot.com/bug?extid=9ef743bba3a17c756174 > > > > >> We can make this deadlock visible by applying [1], and we can confirm that > >> there is a deadlock problem that I think needs to be addressed in core code [2]. > > > > Any reason why you didn't cc: us on these patches? > > We can't apply this "drivers/core: Remove lockdep_set_novalidate_class() usage" patch > until we fix all lockdep warnings that happen during the boot stage; otherwise syzbot > testing can't work which is more painful than applying this patch now. > > Therefore, I locally tested this patch (in order not to be applied now). And I got > a lockdep warning on the perf_event code. I got next lockdep warning on the driver core > code when I tried a fix for the perf_event code suggested by Peter Zijlstra. Since Peter > confirmed that this is a problem that led to commit 1704f47b50b5 ("lockdep: Add novalidate > class for dev->mutex conversion"), this time I'm reporting this problem to you (so that > you can propose a fix for the driver core code). There is no fix. This problem is inherent to the way that lockdep works. Lockdep classifies locks into classes, and it reports a problem if there is a locking cycle (for example, if someone tries to acquire a lock in class B while holding a lock in class A, someone else tries to acquire a lock in class C while holding a lock in class B, and someone else tries to acquire a lock in class A while holding a lock in class C). But this sort of approach doesn't work when you're dealing with a recursive locking structure such as the device tree. Here all the dev->mutex locks belong to the same class, so lockdep thinks there's a problem whenever someone tries to lock a device while holding another device's lock. However, it is always safe to acquire a child device's lock while holding the parent's lock. Lockdep isn't aware of this because it doesn't understand the hierarchical device tree. That's why lockdep checking has to be disabled for dev->mutex; if it weren't disabled then it would constantly report false positives. Alan Stern