On Sat, Feb 11, 2023 at 04:41:11PM -0500, Alan Stern wrote: > Lockdep is blind to the dev->mutex field of struct device, owing to > the fact that these mutexes are assigned to lockdep's "novalidate" > class. Commit 1704f47b50b5 ("lockdep: Add novalidate class for > dev->mutex conversion") did this because the hierarchical nature of > the device tree makes it impossible in practice to determine whether > acquiring one of these mutexes is safe or might lead to a deadlock. > > Unfortunately, this means that lockdep is unable to help diagnose real > deadlocks involving these mutexes when they occur in testing [1] [2] > or in actual use, or to detect bad locking patterns that might lead to > a deadlock. We would like to obtain as much of lockdep's benefits as > possible without generating a flood of false positives -- which is > what happens if one naively removes these mutexes from the > "novalidate" class. > > Accordingly, as a middle ground the mutex in each non-static struct > device will be placed in its own unique locking class. This approach > gives up some of lockdep's advantages (for example, all devices having > a particular bus_type or device_type might reasonably be put into the > same locking class), but it should at least allow us to gain the > benefit of some of lockdep's capabilities. > > Link: https://syzkaller.appspot.com/bug?extid=2d6ac90723742279e101 [1] > Link: https://syzkaller.appspot.com/bug?extid=2e39bc6569d281acbcfb [2] > Link: https://lore.kernel.org/all/28a82f50-39d5-a45f-7c7a-57a66cec0741@xxxxxxxxxxxxxxxxxxx/ > Suggested-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > Signed-off-by: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> > CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > CC: Ingo Molnar <mingo@xxxxxxxxxx> > CC: Boqun Feng <boqun.feng@xxxxxxxxx> My main worry when adding a ton of classes like this is the combinatorics (dynamic classes make this more trivial, but it's entirely possible with just static classes too). As an example, we used to have a static class per cpu runqueue, now, given migration takes two runqueue locks (per locking rules in cpu order -- source and dest runqueue etc..) we got O(n^2) combinations of class orderings, which lead to a giant graph, which led to both performance and memory usage issues. >From this was born the subclass, which reduced the whole thing to a static ordering of runqueue-1 goes inside runqueue-0. Similar combinatoric issues have cropped up in other places from time to time, typically solved by using a different annotation. Now, given the whole device thing is more or less a tree, hierarchical locking should limit that, the only worry is that sibling locking while holding parent thing. If siblings are of different classes, things will both complain and create combinatorics again.