RE: [PATCH v2] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

"Chen, Guchun" <Guchun.Chen@xxxxxxx> · Tue, 11 Jul 2023 01:27:37 +0000

[Public]

> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> Sent: Monday, July 10, 2023 6:16 PM
> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>; Milinkovic, Dusica
> <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui,
> Flora <Flora.Cui@xxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2] drm/amdgpu/vkms: relax timer deactivation by
> hrtimer_try_to_cancel
>
>
>
> Am 10.07.23 um 08:38 schrieb Guchun Chen:
> > In below thousands of screen rotation loop tests with virtual display
> > enabled, a CPU hard lockup issue may happen, leading system to
> > unresponsive and crash.
> >
> > do {
> >     xrandr --output Virtual --rotate inverted
> >     xrandr --output Virtual --rotate right
> >     xrandr --output Virtual --rotate left
> >     xrandr --output Virtual --rotate normal } while (1);
> >
> > NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> >
> > ? hrtimer_run_softirq+0x140/0x140
> > ? store_vblank+0xe0/0xe0 [drm]
> > hrtimer_cancel+0x15/0x30
> > amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
> > drm_vblank_disable_and_save+0x185/0x1f0 [drm]
> > drm_crtc_vblank_off+0x159/0x4c0 [drm]
> > ? record_print_text.cold+0x11/0x11
> > ? wait_for_completion_timeout+0x232/0x280
> > ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
> > bit_wait_io_timeout+0xe0/0xe0 ?
> > wait_for_completion_interruptible+0x1d7/0x320
> > ? mutex_unlock+0x81/0xd0
> > amdgpu_vkms_crtc_atomic_disable
> >
> > It's caused by a stuck in lock dependency in such scenario on
> > different CPUs.
> >
> > CPU1                                             CPU2
> > drm_crtc_vblank_off                              hrtimer_interrupt
> >      grab event_lock (irq disabled)                   __hrtimer_run_queues
> >          grab vbl_lock/vblank_time_block
> amdgpu_vkms_vblank_simulate
> >              amdgpu_vkms_disable_vblank                       drm_handle_vblank
> >                  hrtimer_cancel                                       grab dev->event_lock
> >
> > So CPU1 stucks in hrtimer_cancel as timer callback is running endless
> > on current clock base, as that timer queue on CPU2 has no chance to
> > finish it because of failing to hold the lock. So NMI watchdog will
> > throw the errors after its threshold, and all later CPUs are
> impacted/blocked.
> >
> > So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
> > does not need to wait the handler to finish. And also it's not
> > necessary to check the return value of hrtimer_try_to_cancel, because
> > even if it's
> > -1 which means current timer callback is running, it will be
> > reprogrammed in hrtimer_start with calling enable_vblank to make it works.
> >
> > v2: only re-arm timer when vblank is enabled (Christian) and add a
> > Fixes tag as well
> >
> > Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
> > Cc: stable@xxxxxxxxxxxxxxx
> > Suggested-by: Christian König <christian.koenig@xxxxxxx>
> > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 15 ++++++++++++---
> >   1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > index 53ff91fc6cf6..44d704306f44 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > @@ -46,7 +46,10 @@ static enum hrtimer_restart
> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
> >     struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
> amdgpu_crtc, vblank_timer);
> >     struct drm_crtc *crtc = &amdgpu_crtc->base;
> >     struct amdgpu_vkms_output *output =
> > drm_crtc_to_amdgpu_vkms_output(crtc);
> > +   struct drm_vblank_crtc *vblank;
> > +   struct drm_device *dev;
> >     u64 ret_overrun;
> > +   unsigned int pipe;
> >     bool ret;
> >
> >     ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
> > @@ -54,9 +57,15 @@ static enum hrtimer_restart
> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
> >     if (ret_overrun != 1)
> >             DRM_WARN("%s: vblank timer overrun\n", __func__);
> >
> > +   dev = crtc->dev;
> > +   pipe = drm_crtc_index(crtc);
> > +   vblank = &dev->vblank[pipe];
> >     ret = drm_crtc_handle_vblank(crtc);
> > -   if (!ret)
> > -           DRM_ERROR("amdgpu_vkms failure on handling vblank");
> > +   if (!ret && !READ_ONCE(vblank->enabled)) {
> > +           /* Don't queue timer again when vblank is disabled. */
> > +           DRM_WARN("amdgpu_vkms failure on handling vblank\n");
>
> You should probably only print the warning when really an error happened.
>
> Disabling the vblank and not firing the timer again is a perfectly normal
> operation.
>
> Apart from that looks good to me,
> Christian.

Thanks for your comment, Christian. Then I think I can drop the printing, as drm_crtc_handle_vblank have two return cases, one is false when vblank is disabled, and the other is always true.
I will fix this in patch v3.

Regards,
Guchun

> > +           return HRTIMER_NORESTART;
> > +   }
> >
> >     return HRTIMER_RESTART;
> >   }
> > @@ -81,7 +90,7 @@ static void amdgpu_vkms_disable_vblank(struct
> drm_crtc *crtc)
> >   {
> >     struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
> >
> > -   hrtimer_cancel(&amdgpu_crtc->vblank_timer);
> > +   hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
> >   }
> >
> >   static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,