[Public] > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > Sent: Monday, July 10, 2023 6:16 PM > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Milinkovic, Dusica > <Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui, > Flora <Flora.Cui@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Subject: Re: [PATCH v2] drm/amdgpu/vkms: relax timer deactivation by > hrtimer_try_to_cancel > > > > Am 10.07.23 um 08:38 schrieb Guchun Chen: > > In below thousands of screen rotation loop tests with virtual display > > enabled, a CPU hard lockup issue may happen, leading system to > > unresponsive and crash. > > > > do { > > xrandr --output Virtual --rotate inverted > > xrandr --output Virtual --rotate right > > xrandr --output Virtual --rotate left > > xrandr --output Virtual --rotate normal } while (1); > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 > > > > ? hrtimer_run_softirq+0x140/0x140 > > ? store_vblank+0xe0/0xe0 [drm] > > hrtimer_cancel+0x15/0x30 > > amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] > > drm_vblank_disable_and_save+0x185/0x1f0 [drm] > > drm_crtc_vblank_off+0x159/0x4c0 [drm] > > ? record_print_text.cold+0x11/0x11 > > ? wait_for_completion_timeout+0x232/0x280 > > ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? > > bit_wait_io_timeout+0xe0/0xe0 ? > > wait_for_completion_interruptible+0x1d7/0x320 > > ? mutex_unlock+0x81/0xd0 > > amdgpu_vkms_crtc_atomic_disable > > > > It's caused by a stuck in lock dependency in such scenario on > > different CPUs. > > > > CPU1 CPU2 > > drm_crtc_vblank_off hrtimer_interrupt > > grab event_lock (irq disabled) __hrtimer_run_queues > > grab vbl_lock/vblank_time_block > amdgpu_vkms_vblank_simulate > > amdgpu_vkms_disable_vblank drm_handle_vblank > > hrtimer_cancel grab dev->event_lock > > > > So CPU1 stucks in hrtimer_cancel as timer callback is running endless > > on current clock base, as that timer queue on CPU2 has no chance to > > finish it because of failing to hold the lock. So NMI watchdog will > > throw the errors after its threshold, and all later CPUs are > impacted/blocked. > > > > So use hrtimer_try_to_cancel to fix this, as disable_vblank callback > > does not need to wait the handler to finish. And also it's not > > necessary to check the return value of hrtimer_try_to_cancel, because > > even if it's > > -1 which means current timer callback is running, it will be > > reprogrammed in hrtimer_start with calling enable_vblank to make it works. > > > > v2: only re-arm timer when vblank is enabled (Christian) and add a > > Fixes tag as well > > > > Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)") > > Cc: stable@xxxxxxxxxxxxxxx > > Suggested-by: Christian König <christian.koenig@xxxxxxx> > > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 15 ++++++++++++--- > > 1 file changed, 12 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > index 53ff91fc6cf6..44d704306f44 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c > > @@ -46,7 +46,10 @@ static enum hrtimer_restart > amdgpu_vkms_vblank_simulate(struct hrtimer *timer) > > struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct > amdgpu_crtc, vblank_timer); > > struct drm_crtc *crtc = &amdgpu_crtc->base; > > struct amdgpu_vkms_output *output = > > drm_crtc_to_amdgpu_vkms_output(crtc); > > + struct drm_vblank_crtc *vblank; > > + struct drm_device *dev; > > u64 ret_overrun; > > + unsigned int pipe; > > bool ret; > > > > ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer, > > @@ -54,9 +57,15 @@ static enum hrtimer_restart > amdgpu_vkms_vblank_simulate(struct hrtimer *timer) > > if (ret_overrun != 1) > > DRM_WARN("%s: vblank timer overrun\n", __func__); > > > > + dev = crtc->dev; > > + pipe = drm_crtc_index(crtc); > > + vblank = &dev->vblank[pipe]; > > ret = drm_crtc_handle_vblank(crtc); > > - if (!ret) > > - DRM_ERROR("amdgpu_vkms failure on handling vblank"); > > + if (!ret && !READ_ONCE(vblank->enabled)) { > > + /* Don't queue timer again when vblank is disabled. */ > > + DRM_WARN("amdgpu_vkms failure on handling vblank\n"); > > You should probably only print the warning when really an error happened. > > Disabling the vblank and not firing the timer again is a perfectly normal > operation. > > Apart from that looks good to me, > Christian. Thanks for your comment, Christian. Then I think I can drop the printing, as drm_crtc_handle_vblank have two return cases, one is false when vblank is disabled, and the other is always true. I will fix this in patch v3. Regards, Guchun > > + return HRTIMER_NORESTART; > > + } > > > > return HRTIMER_RESTART; > > } > > @@ -81,7 +90,7 @@ static void amdgpu_vkms_disable_vblank(struct > drm_crtc *crtc) > > { > > struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc); > > > > - hrtimer_cancel(&amdgpu_crtc->vblank_timer); > > + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); > > } > > > > static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,