RE: [PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

"Chen, Guchun" <Guchun.Chen@xxxxxxx> · Fri, 7 Jul 2023 01:06:00 +0000

[Public]

> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> Sent: Thursday, July 6, 2023 7:25 PM
> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>; Milinkovic, Dusica
> <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui,
> Flora <Flora.Cui@xxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by
> hrtimer_try_to_cancel
>
> Am 06.07.23 um 10:35 schrieb Guchun Chen:
> > In below thousands of screen rotation loop tests with virtual display
> > enabled, a CPU hard lockup issue may happen, leading system to
> > unresponsive and crash.
> >
> > do {
> >     xrandr --output Virtual --rotate inverted
> >     xrandr --output Virtual --rotate right
> >     xrandr --output Virtual --rotate left
> >     xrandr --output Virtual --rotate normal } while (1);
> >
> > NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
> >
> > ? hrtimer_run_softirq+0x140/0x140
> > ? store_vblank+0xe0/0xe0 [drm]
> > hrtimer_cancel+0x15/0x30
> > amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
> > drm_vblank_disable_and_save+0x185/0x1f0 [drm]
> > drm_crtc_vblank_off+0x159/0x4c0 [drm]
> > ? record_print_text.cold+0x11/0x11
> > ? wait_for_completion_timeout+0x232/0x280
> > ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
> > bit_wait_io_timeout+0xe0/0xe0 ?
> > wait_for_completion_interruptible+0x1d7/0x320
> > ? mutex_unlock+0x81/0xd0
> > amdgpu_vkms_crtc_atomic_disable
> >
> > It's caused by a stuck in lock dependency in such scenario on
> > different CPUs.
> >
> > CPU1                                             CPU2
> > drm_crtc_vblank_off                              hrtimer_interrupt
> >      grab event_lock (irq disabled)                   __hrtimer_run_queues
> >          grab vbl_lock/vblank_time_block
> amdgpu_vkms_vblank_simulate
> >              amdgpu_vkms_disable_vblank                       drm_handle_vblank
> >                  hrtimer_cancel                                   grab dev->event_lock
> >
> > So CPU1 stucks in hrtimer_cancel as timer callback is running endless
> > on current clock base, as that timer queue on CPU2 has no chance to
> > finish it because of failing to hold the lock. So NMI watchdog will
> > throw the errors after its threshold, and all later CPUs are
> impacted/blocked.
> >
> > So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
> > does not need to wait the handler to finish. And also it's not
> > necessary to check the return value of hrtimer_try_to_cancel, because
> > even if it's
> > -1 which means current timer callback is running, it will be
> > reprogrammed in hrtimer_start with calling enable_vblank to make it works.
> >
> > Cc: stable@xxxxxxxxxxxxxxx
> > Suggested-by: Christian König <christian.koenig@xxxxxxx>
> > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > index 53ff91fc6cf6..70fb0df039e3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct
> drm_crtc *crtc)
> >   {
> >     struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
> >
> > -   hrtimer_cancel(&amdgpu_crtc->vblank_timer);
> > +   hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
>
> That's a first step, but not sufficient.
>
> You also need to change the "return HRTIMER_RESTART;" in
> amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is
> enabled.
>
> Finally I strongly suggest to implement a amdgpu_vkms_destroy() function to
> make sure the HRTIMER is properly cleaned up.

Good suggestion, will fix it in V2.

Regards,
Guchun
> Regards,
> Christian.
>
> >   }
> >
> >   static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,