[Public] > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > Sent: Thursday, July 6, 2023 7:25 PM > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Milinkovic, Dusica > <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui, > Flora <Flora.Cui@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by > hrtimer_try_to_cancel > > Am 06.07.23 um 10:35 schrieb Guchun Chen: > > In below thousands of screen rotation loop tests with virtual display > > enabled, a CPU hard lockup issue may happen, leading system to > > unresponsive and crash. > > > > do { > > xrandr --output Virtual --rotate inverted > > xrandr --output Virtual --rotate right > > xrandr --output Virtual --rotate left > > xrandr --output Virtual --rotate normal } while (1); > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 4 > > > > ? hrtimer_run_softirq+0x140/0x140 > > ? store_vblank+0xe0/0xe0 [drm] > > hrtimer_cancel+0x15/0x30 > > amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] > > drm_vblank_disable_and_save+0x185/0x1f0 [drm] > > drm_crtc_vblank_off+0x159/0x4c0 [drm] > > ? record_print_text.cold+0x11/0x11 > > ? wait_for_completion_timeout+0x232/0x280 > > ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? > > bit_wait_io_timeout+0xe0/0xe0 ? > > wait_for_completion_interruptible+0x1d7/0x320 > > ? mutex_unlock+0x81/0xd0 > > amdgpu_vkms_crtc_atomic_disable > > > > It's caused by a stuck in lock dependency in such scenario on > > different CPUs. > > > > CPU1 CPU2 > > drm_crtc_vblank_off hrtimer_interrupt > > grab event_lock (irq disabled) __hrtimer_run_queues > > grab vbl_lock/vblank_time_block > amdgpu_vkms_vblank_simulate > > amdgpu_vkms_disable_vblank drm_handle_vblank > > hrtimer_cancel grab dev->event_lock > > > > So CPU1 stucks in hrtimer_cancel as timer callback is running endless > > on current clock base, as that timer queue on CPU2 has no chance to > > finish it because of failing to hold the lock. So NMI watchdog will > > throw the errors after its threshold, and all later CPUs are > impacted/blocked. > > > > So use hrtimer_try_to_cancel to fix this, as disable_vblank callback > > does not need to wait the handler to finish. And also it's not > > necessary to check the return value of hrtimer_try_to_cancel, because > > even if it's > > -1 which means current timer callback is running, it will be > > reprogrammed in hrtimer_start with calling enable_vblank to make it works. > > > > Cc: stable@xxxxxxxxxxxxxxx > > Suggested-by: Christian König <christian.koenig@xxxxxxx> > > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > index 53ff91fc6cf6..70fb0df039e3 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct > drm_crtc *crtc) > > { > > struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc); > > > > - hrtimer_cancel(&amdgpu_crtc->vblank_timer); > > + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); > > That's a first step, but not sufficient. > > You also need to change the "return HRTIMER_RESTART;" in > amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is > enabled. > > Finally I strongly suggest to implement a amdgpu_vkms_destroy() function to > make sure the HRTIMER is properly cleaned up. Good suggestion, will fix it in V2. Regards, Guchun > Regards, > Christian. > > > } > > > > static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,