On Fri, Dec 18, 2020 at 5:10 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > > On Thu, 17 Dec 2020 11:03:20 +0100 > Daniel Vetter <daniel.vetter@xxxxxxxx> wrote: > > > I think we're tripping over the might_sleep() all the mutexes have, > > and that's not as good as yours, but good enough to catch a missing > > rcu_read_unlock(). That's kinda why I'm baffled, since like almost > > every 2nd function in the backtrace grabbed a mutex and it was all > > fine until the very last. > > > > I think it would be really nice if the rcu checks could retain (in > > debugging only) the backtrace of the outermost rcu_read_lock, so we > > could print that when something goes wrong in cases where it's leaked. > > For normal locks lockdep does that already (well not full backtrace I > > think, just the function that acquired the lock, but that's often > > enough). I guess that doesn't exist yet? > > > > Also yes without reproducer this is kinda tough nut to crack. > > I'm looking at drm_client_modeset_commit_atomic(), where it triggered after > the "retry:" label, which to get to, does a bit of goto spaghetti, with > a -EDEADLK detected and a goto backoff, which calls goto retry, and then > the next mutex taken is the one that triggers the bug. This is standard drm locking spaghetti using ww_mutexes. Enable CONFIG_DEBUG_WW_MUTEX_SLOWPATH and you'll hit this all the time, in all kinds of situations. We're using this all the time because it's way too easy to to get the error cases wrong. > As this is hard to reproduce, but reproducible by a fuzzer, I'm guessing > there's some error return path somewhere in there that doesn't release an > rcu_read_lock(). We're maybe a bit too happy to use funny locking schemes like ww_mutex, but at least to my knowledge there's no rcu anywhere near these. Or preempt disable fwiw (which I think the consensus is the more likely culprit). So I have no idea how this leaks. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch