[Public] > -----Original Message----- > From: Chen, Guchun > Sent: Friday, July 7, 2023 9:06 AM > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Milinkovic, Dusica > <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui, > Flora <Flora.Cui@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Subject: RE: [PATCH] drm/amdgpu/vkms: relax timer deactivation by > hrtimer_try_to_cancel > > > > > -----Original Message----- > > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > > Sent: Thursday, July 6, 2023 7:25 PM > > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd- > > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; > > Milinkovic, Dusica <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola > > <Nikola.Prica@xxxxxxx>; Cui, Flora <Flora.Cui@xxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx > > Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by > > hrtimer_try_to_cancel > > > > Am 06.07.23 um 10:35 schrieb Guchun Chen: > > > In below thousands of screen rotation loop tests with virtual > > > display enabled, a CPU hard lockup issue may happen, leading system > > > to unresponsive and crash. > > > > > > do { > > > xrandr --output Virtual --rotate inverted > > > xrandr --output Virtual --rotate right > > > xrandr --output Virtual --rotate left > > > xrandr --output Virtual --rotate normal } while (1); > > > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 4 > > > > > > ? hrtimer_run_softirq+0x140/0x140 > > > ? store_vblank+0xe0/0xe0 [drm] > > > hrtimer_cancel+0x15/0x30 > > > amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] > > > drm_vblank_disable_and_save+0x185/0x1f0 [drm] > > > drm_crtc_vblank_off+0x159/0x4c0 [drm] ? > > > record_print_text.cold+0x11/0x11 ? > > > wait_for_completion_timeout+0x232/0x280 > > > ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? > > > bit_wait_io_timeout+0xe0/0xe0 ? > > > wait_for_completion_interruptible+0x1d7/0x320 > > > ? mutex_unlock+0x81/0xd0 > > > amdgpu_vkms_crtc_atomic_disable > > > > > > It's caused by a stuck in lock dependency in such scenario on > > > different CPUs. > > > > > > CPU1 CPU2 > > > drm_crtc_vblank_off hrtimer_interrupt > > > grab event_lock (irq disabled) __hrtimer_run_queues > > > grab vbl_lock/vblank_time_block > > amdgpu_vkms_vblank_simulate > > > amdgpu_vkms_disable_vblank drm_handle_vblank > > > hrtimer_cancel grab dev->event_lock > > > > > > So CPU1 stucks in hrtimer_cancel as timer callback is running > > > endless on current clock base, as that timer queue on CPU2 has no > > > chance to finish it because of failing to hold the lock. So NMI > > > watchdog will throw the errors after its threshold, and all later > > > CPUs are > > impacted/blocked. > > > > > > So use hrtimer_try_to_cancel to fix this, as disable_vblank callback > > > does not need to wait the handler to finish. And also it's not > > > necessary to check the return value of hrtimer_try_to_cancel, > > > because even if it's > > > -1 which means current timer callback is running, it will be > > > reprogrammed in hrtimer_start with calling enable_vblank to make it > works. > > > > > > Cc: stable@xxxxxxxxxxxxxxx > > > Suggested-by: Christian König <christian.koenig@xxxxxxx> > > > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > > index 53ff91fc6cf6..70fb0df039e3 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > > @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct > > drm_crtc *crtc) > > > { > > > struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc); > > > > > > - hrtimer_cancel(&amdgpu_crtc->vblank_timer); > > > + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); > > > > That's a first step, but not sufficient. > > > > You also need to change the "return HRTIMER_RESTART;" in > > amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is > > enabled. > > > > Finally I strongly suggest to implement a amdgpu_vkms_destroy() > > function to make sure the HRTIMER is properly cleaned up. > > Good suggestion, will fix it in V2. > Hi Christian, I just sent out patch v2 to address the return value problem in amdgpu_vkms_vblank_simulate. Regarding HRTIMER cleanup, it's handled in sw_fini in amdgpu_vkms code, I think so far it's good. Anyway, we can continue the discussion in the new patch set thread. Regards, Guchun > Regards, > Guchun > > Regards, > > Christian. > > > > > } > > > > > > static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc > > > *crtc,