RE: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

"Chen, Guchun" <Guchun.Chen@xxxxxxx> · Tue, 11 Jul 2023 09:31:22 +0000

[Public]

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
> Sent: Tuesday, July 11, 2023 5:23 PM
> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>; Milinkovic, Dusica
> <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui,
> Flora <Flora.Cui@xxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
> hrtimer_try_to_cancel
>
> Am 11.07.23 um 11:15 schrieb Chen, Guchun:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
> >> Sent: Tuesday, July 11, 2023 5:09 PM
> >> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-
> >> gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> >> <Alexander.Deucher@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>;
> >> Koenig, Christian <Christian.Koenig@xxxxxxx>; Milinkovic, Dusica
> >> <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>;
> >> Cui, Flora <Flora.Cui@xxxxxxx>
> >> Cc: stable@xxxxxxxxxxxxxxx
> >> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
> >> hrtimer_try_to_cancel
> >>
> >>
> >>
> >> Am 11.07.23 um 03:38 schrieb Guchun Chen:
> >>> In below thousands of screen rotation loop tests with virtual
> >>> display enabled, a CPU hard lockup issue may happen, leading system
> >>> to unresponsive and crash.
> >>>
> >>> do {
> >>>      xrandr --output Virtual --rotate inverted
> >>>      xrandr --output Virtual --rotate right
> >>>      xrandr --output Virtual --rotate left
> >>>      xrandr --output Virtual --rotate normal } while (1);
> >>>
> >>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> >>>
> >>> ? hrtimer_run_softirq+0x140/0x140
> >>> ? store_vblank+0xe0/0xe0 [drm]
> >>> hrtimer_cancel+0x15/0x30
> >>> amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
> >>> drm_vblank_disable_and_save+0x185/0x1f0 [drm]
> >>> drm_crtc_vblank_off+0x159/0x4c0 [drm] ?
> >>> record_print_text.cold+0x11/0x11 ?
> >>> wait_for_completion_timeout+0x232/0x280
> >>> ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
> >>> bit_wait_io_timeout+0xe0/0xe0 ?
> >>> wait_for_completion_interruptible+0x1d7/0x320
> >>> ? mutex_unlock+0x81/0xd0
> >>> amdgpu_vkms_crtc_atomic_disable
> >>>
> >>> It's caused by a stuck in lock dependency in such scenario on
> >>> different CPUs.
> >>>
> >>> CPU1                                             CPU2
> >>> drm_crtc_vblank_off                              hrtimer_interrupt
> >>>       grab event_lock (irq disabled)                   __hrtimer_run_queues
> >>>           grab vbl_lock/vblank_time_block
> >> amdgpu_vkms_vblank_simulate
> >>>               amdgpu_vkms_disable_vblank                       drm_handle_vblank
> >>>                   hrtimer_cancel                                         grab dev->event_lock
> >>>
> >>> So CPU1 stucks in hrtimer_cancel as timer callback is running
> >>> endless on current clock base, as that timer queue on CPU2 has no
> >>> chance to finish it because of failing to hold the lock. So NMI
> >>> watchdog will throw the errors after its threshold, and all later
> >>> CPUs are
> >> impacted/blocked.
> >>> So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
> >>> does not need to wait the handler to finish. And also it's not
> >>> necessary to check the return value of hrtimer_try_to_cancel,
> >>> because even if it's
> >>> -1 which means current timer callback is running, it will be
> >>> reprogrammed in hrtimer_start with calling enable_vblank to make it
> works.
> >>>
> >>> v2: only re-arm timer when vblank is enabled (Christian) and add a
> >>> Fixes tag as well
> >>>
> >>> v3: drop warn printing (Christian)
> >>>
> >>> Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
> >>> Cc: stable@xxxxxxxxxxxxxxx
> >>> Suggested-by: Christian König <christian.koenig@xxxxxxx>
> >>> Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++---
> >>>    1 file changed, 10 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> index 53ff91fc6cf6..b870c827cbaa 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> @@ -46,7 +46,10 @@ static enum hrtimer_restart
> >> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
> >>>      struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
> >> amdgpu_crtc, vblank_timer);
> >>>      struct drm_crtc *crtc = &amdgpu_crtc->base;
> >>>      struct amdgpu_vkms_output *output =
> >>> drm_crtc_to_amdgpu_vkms_output(crtc);
> >>> +   struct drm_vblank_crtc *vblank;
> >>> +   struct drm_device *dev;
> >>>      u64 ret_overrun;
> >>> +   unsigned int pipe;
> >>>      bool ret;
> >>>
> >>>      ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
> >>> @@ -54,9 +57,13 @@ static enum hrtimer_restart
> >> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
> >>>      if (ret_overrun != 1)
> >>>              DRM_WARN("%s: vblank timer overrun\n", __func__);
> >>>
> >>> +   dev = crtc->dev;
> >>> +   pipe = drm_crtc_index(crtc);
> >>> +   vblank = &dev->vblank[pipe];
> >>>      ret = drm_crtc_handle_vblank(crtc);
> >>> -   if (!ret)
> >>> -           DRM_ERROR("amdgpu_vkms failure on handling vblank");
> >>> +   /* Don't queue timer again when vblank is disabled. */
> >>> +   if (!ret && !READ_ONCE(vblank->enabled))
> >>> +           return HRTIMER_NORESTART;
> >> When drm_crtc_handle_vblank() returns false when vblank is disabled I
> >> think we can simplify this to just removing the error.
> >>
> >> Regards,
> >> Christian.
> > Sorry, I didn't get you. What do you mean by "removing the error"?
>
> We should just remove the "DRM_ERROR("amdgpu_vkms failure on handling
> vblank");" message.
>
> When the drm_crtc_handle_vblank() returns false it doesn't really indicate a
> failure, it just indicates that the vblank is disabled and shouldn't be re-armed.
>
> Regards,
> Christian.

drm_crtc_handle_vblank which is a wrapper of drm_handle_vblank, has two early test calls to check if vblank is initialized. Though this will never happen in our case, I still check the value of vblank->enabled when drm_crtc_handle_vblank returns false, and when it's indeed disabled, return HRTIMER_NORESTART to not re-arm timer, otherwise, returning HRTIMER_RESTART when it's going as expected.

Anything wrong or am I misunderstanding it?

Regards,
Guchun
> >
> > Regards,
> > Guchun
> >>>      return HRTIMER_RESTART;
> >>>    }
> >>> @@ -81,7 +88,7 @@ static void amdgpu_vkms_disable_vblank(struct
> >> drm_crtc *crtc)
> >>>    {
> >>>      struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
> >>>
> >>> -   hrtimer_cancel(&amdgpu_crtc->vblank_timer);
> >>> +   hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
> >>>    }
> >>>
> >>>    static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc
> >>> *crtc,