On Wed, Sep 13 2023 at 09:47, Linus Torvalds wrote: > On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa > <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: >> >> Hello. A deadlock was reported in drivers/gpu/drm/vkms/ . >> It looks like this locking pattern remains as of 6.6-rc1. Please fix. >> >> void drm_crtc_vblank_off(struct drm_crtc *crtc) { >> spin_lock_irq(&dev->event_lock); >> drm_vblank_disable_and_save(dev, pipe) { >> __disable_vblank(dev, pipe) { >> crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank { >> hrtimer_cancel(&out->vblank_hrtimer) { // Retries with dev->event_lock held until lock_hrtimer_base() succeeds. >> ret = hrtimer_try_to_cancel(timer) { >> base = lock_hrtimer_base(timer, &flags); // Fails forever because vkms_vblank_simulate() is in progress. > > Heh. Ok. This is clearly a bug, but it does seem to be limited to just > the vkms driver, and literally only to the "simulate vblank" case. > > The worst part about it is that it's so subtle and not obvious. > > Some light grepping seems to show that amdgpu has almost the exact > same pattern in its own vkms thing, except it uses > > hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); > > directly, which presumably fixes the livelock, but means that the > cancel will fail if it's currently running. > > So just doing the same thing in the vkms driver probably fixes things. > > Maybe the vkms people need to add a flag to say "it's canceled" so > that it doesn't then get re-enabled? Or maybe it doesn't matter > and/or already happens for some reason I didn't look into. Maybe the VKMS people need to understand locking in the first place. The first thing I saw in this code is: static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) { ... mutex_unlock(&output->enabled_lock); What? Unlocking a mutex in the context of a hrtimer callback is simply violating all mutex locking rules. How has this code ever survived lock debugging without triggering a big fat warning? Thanks, tglx