[Public] > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Sent: Tuesday, July 11, 2023 5:23 PM > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; Milinkovic, Dusica > <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui, > Flora <Flora.Cui@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by > hrtimer_try_to_cancel > > Am 11.07.23 um 11:15 schrieb Chen, Guchun: > > [Public] > > > >> -----Original Message----- > >> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > >> Sent: Tuesday, July 11, 2023 5:09 PM > >> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd- > >> gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > >> <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; > >> Koenig, Christian <Christian.Koenig@xxxxxxx>; Milinkovic, Dusica > >> <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; > >> Cui, Flora <Flora.Cui@xxxxxxx> > >> Cc: stable@xxxxxxxxxxxxxxx > >> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by > >> hrtimer_try_to_cancel > >> > >> > >> > >> Am 11.07.23 um 03:38 schrieb Guchun Chen: > >>> In below thousands of screen rotation loop tests with virtual > >>> display enabled, a CPU hard lockup issue may happen, leading system > >>> to unresponsive and crash. > >>> > >>> do { > >>> xrandr --output Virtual --rotate inverted > >>> xrandr --output Virtual --rotate right > >>> xrandr --output Virtual --rotate left > >>> xrandr --output Virtual --rotate normal } while (1); > >>> > >>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 > >>> > >>> ? hrtimer_run_softirq+0x140/0x140 > >>> ? store_vblank+0xe0/0xe0 [drm] > >>> hrtimer_cancel+0x15/0x30 > >>> amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] > >>> drm_vblank_disable_and_save+0x185/0x1f0 [drm] > >>> drm_crtc_vblank_off+0x159/0x4c0 [drm] ? > >>> record_print_text.cold+0x11/0x11 ? > >>> wait_for_completion_timeout+0x232/0x280 > >>> ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? > >>> bit_wait_io_timeout+0xe0/0xe0 ? > >>> wait_for_completion_interruptible+0x1d7/0x320 > >>> ? mutex_unlock+0x81/0xd0 > >>> amdgpu_vkms_crtc_atomic_disable > >>> > >>> It's caused by a stuck in lock dependency in such scenario on > >>> different CPUs. > >>> > >>> CPU1 CPU2 > >>> drm_crtc_vblank_off hrtimer_interrupt > >>> grab event_lock (irq disabled) __hrtimer_run_queues > >>> grab vbl_lock/vblank_time_block > >> amdgpu_vkms_vblank_simulate > >>> amdgpu_vkms_disable_vblank drm_handle_vblank > >>> hrtimer_cancel grab dev->event_lock > >>> > >>> So CPU1 stucks in hrtimer_cancel as timer callback is running > >>> endless on current clock base, as that timer queue on CPU2 has no > >>> chance to finish it because of failing to hold the lock. So NMI > >>> watchdog will throw the errors after its threshold, and all later > >>> CPUs are > >> impacted/blocked. > >>> So use hrtimer_try_to_cancel to fix this, as disable_vblank callback > >>> does not need to wait the handler to finish. And also it's not > >>> necessary to check the return value of hrtimer_try_to_cancel, > >>> because even if it's > >>> -1 which means current timer callback is running, it will be > >>> reprogrammed in hrtimer_start with calling enable_vblank to make it > works. > >>> > >>> v2: only re-arm timer when vblank is enabled (Christian) and add a > >>> Fixes tag as well > >>> > >>> v3: drop warn printing (Christian) > >>> > >>> Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)") > >>> Cc: stable@xxxxxxxxxxxxxxx > >>> Suggested-by: Christian König <christian.koenig@xxxxxxx> > >>> Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++--- > >>> 1 file changed, 10 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > >>> index 53ff91fc6cf6..b870c827cbaa 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > >>> @@ -46,7 +46,10 @@ static enum hrtimer_restart > >> amdgpu_vkms_vblank_simulate(struct hrtimer *timer) > >>> struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct > >> amdgpu_crtc, vblank_timer); > >>> struct drm_crtc *crtc = &amdgpu_crtc->base; > >>> struct amdgpu_vkms_output *output = > >>> drm_crtc_to_amdgpu_vkms_output(crtc); > >>> + struct drm_vblank_crtc *vblank; > >>> + struct drm_device *dev; > >>> u64 ret_overrun; > >>> + unsigned int pipe; > >>> bool ret; > >>> > >>> ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer, > >>> @@ -54,9 +57,13 @@ static enum hrtimer_restart > >> amdgpu_vkms_vblank_simulate(struct hrtimer *timer) > >>> if (ret_overrun != 1) > >>> DRM_WARN("%s: vblank timer overrun\n", __func__); > >>> > >>> + dev = crtc->dev; > >>> + pipe = drm_crtc_index(crtc); > >>> + vblank = &dev->vblank[pipe]; > >>> ret = drm_crtc_handle_vblank(crtc); > >>> - if (!ret) > >>> - DRM_ERROR("amdgpu_vkms failure on handling vblank"); > >>> + /* Don't queue timer again when vblank is disabled. */ > >>> + if (!ret && !READ_ONCE(vblank->enabled)) > >>> + return HRTIMER_NORESTART; > >> When drm_crtc_handle_vblank() returns false when vblank is disabled I > >> think we can simplify this to just removing the error. > >> > >> Regards, > >> Christian. > > Sorry, I didn't get you. What do you mean by "removing the error"? > > We should just remove the "DRM_ERROR("amdgpu_vkms failure on handling > vblank");" message. > > When the drm_crtc_handle_vblank() returns false it doesn't really indicate a > failure, it just indicates that the vblank is disabled and shouldn't be re-armed. > > Regards, > Christian. drm_crtc_handle_vblank which is a wrapper of drm_handle_vblank, has two early test calls to check if vblank is initialized. Though this will never happen in our case, I still check the value of vblank->enabled when drm_crtc_handle_vblank returns false, and when it's indeed disabled, return HRTIMER_NORESTART to not re-arm timer, otherwise, returning HRTIMER_RESTART when it's going as expected. Anything wrong or am I misunderstanding it? Regards, Guchun > > > > Regards, > > Guchun > >>> return HRTIMER_RESTART; > >>> } > >>> @@ -81,7 +88,7 @@ static void amdgpu_vkms_disable_vblank(struct > >> drm_crtc *crtc) > >>> { > >>> struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc); > >>> > >>> - hrtimer_cancel(&amdgpu_crtc->vblank_timer); > >>> + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); > >>> } > >>> > >>> static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc > >>> *crtc,