Re: [PATCH RFC] drivers/core: Replace lockdep_set_novalidate_class() with unique class keys

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Mon, 13 Feb 2023 10:49:50 +0100

On Sat, Feb 11, 2023 at 04:41:11PM -0500, Alan Stern wrote:
> Lockdep is blind to the dev->mutex field of struct device, owing to
> the fact that these mutexes are assigned to lockdep's "novalidate"
> class.  Commit 1704f47b50b5 ("lockdep: Add novalidate class for
> dev->mutex conversion") did this because the hierarchical nature of
> the device tree makes it impossible in practice to determine whether
> acquiring one of these mutexes is safe or might lead to a deadlock.
> 
> Unfortunately, this means that lockdep is unable to help diagnose real
> deadlocks involving these mutexes when they occur in testing [1] [2]
> or in actual use, or to detect bad locking patterns that might lead to
> a deadlock.  We would like to obtain as much of lockdep's benefits as
> possible without generating a flood of false positives -- which is
> what happens if one naively removes these mutexes from the
> "novalidate" class.
> 
> Accordingly, as a middle ground the mutex in each non-static struct
> device will be placed in its own unique locking class.  This approach
> gives up some of lockdep's advantages (for example, all devices having
> a particular bus_type or device_type might reasonably be put into the
> same locking class), but it should at least allow us to gain the
> benefit of some of lockdep's capabilities.
> 
> Link: https://syzkaller.appspot.com/bug?extid=2d6ac90723742279e101 [1]
> Link: https://syzkaller.appspot.com/bug?extid=2e39bc6569d281acbcfb [2]
> Link: https://lore.kernel.org/all/28a82f50-39d5-a45f-7c7a-57a66cec0741@xxxxxxxxxxxxxxxxxxx/
> Suggested-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>
> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Boqun Feng <boqun.feng@xxxxxxxxx>

My main worry when adding a ton of classes like this is the
combinatorics (dynamic classes make this more trivial, but it's entirely
possible with just static classes too).

As an example, we used to have a static class per cpu runqueue, now,
given migration takes two runqueue locks (per locking rules in cpu
order -- source and dest runqueue etc..) we got O(n^2) combinations of
class orderings, which lead to a giant graph, which led to both
performance and memory usage issues.

>From this was born the subclass, which reduced the whole thing to a
static ordering of runqueue-1 goes inside runqueue-0.

Similar combinatoric issues have cropped up in other places from time to
time, typically solved by using a different annotation.

Now, given the whole device thing is more or less a tree, hierarchical
locking should limit that, the only worry is that sibling locking while
holding parent thing. If siblings are of different classes, things will
both complain and create combinatorics again.