-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Sent: Tuesday, July 11, 2023 5:09 PM
To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
<Alexander.Deucher@xxxxxxx>; Zhang, Hawking
<Hawking.Zhang@xxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Milinkovic, Dusica
<Dusica.Milinkovic@xxxxxxx>; Prica, Nikola <Nikola.Prica@xxxxxxx>; Cui,
Flora <Flora.Cui@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
hrtimer_try_to_cancel
Am 11.07.23 um 03:38 schrieb Guchun Chen:
In below thousands of screen rotation loop tests with virtual display
enabled, a CPU hard lockup issue may happen, leading system to
unresponsive and crash.
do {
xrandr --output Virtual --rotate inverted
xrandr --output Virtual --rotate right
xrandr --output Virtual --rotate left
xrandr --output Virtual --rotate normal } while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
? hrtimer_run_softirq+0x140/0x140
? store_vblank+0xe0/0xe0 [drm]
hrtimer_cancel+0x15/0x30
amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
drm_vblank_disable_and_save+0x185/0x1f0 [drm]
drm_crtc_vblank_off+0x159/0x4c0 [drm]
? record_print_text.cold+0x11/0x11
? wait_for_completion_timeout+0x232/0x280
? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
bit_wait_io_timeout+0xe0/0xe0 ?
wait_for_completion_interruptible+0x1d7/0x320
? mutex_unlock+0x81/0xd0
amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on
different CPUs.
CPU1 CPU2
drm_crtc_vblank_off hrtimer_interrupt
grab event_lock (irq disabled) __hrtimer_run_queues
grab vbl_lock/vblank_time_block
amdgpu_vkms_vblank_simulate
amdgpu_vkms_disable_vblank drm_handle_vblank
hrtimer_cancel grab dev->event_lock
So CPU1 stucks in hrtimer_cancel as timer callback is running endless
on current clock base, as that timer queue on CPU2 has no chance to
finish it because of failing to hold the lock. So NMI watchdog will
throw the errors after its threshold, and all later CPUs are
impacted/blocked.
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
does not need to wait the handler to finish. And also it's not
necessary to check the return value of hrtimer_try_to_cancel, because
even if it's
-1 which means current timer callback is running, it will be
reprogrammed in hrtimer_start with calling enable_vblank to make it works.
v2: only re-arm timer when vblank is enabled (Christian) and add a
Fixes tag as well
v3: drop warn printing (Christian)
Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
Cc: stable@xxxxxxxxxxxxxxx
Suggested-by: Christian König <christian.koenig@xxxxxxx>
Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
index 53ff91fc6cf6..b870c827cbaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
@@ -46,7 +46,10 @@ static enum hrtimer_restart
amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
amdgpu_crtc, vblank_timer);
struct drm_crtc *crtc = &amdgpu_crtc->base;
struct amdgpu_vkms_output *output =
drm_crtc_to_amdgpu_vkms_output(crtc);
+ struct drm_vblank_crtc *vblank;
+ struct drm_device *dev;
u64 ret_overrun;
+ unsigned int pipe;
bool ret;
ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
@@ -54,9 +57,13 @@ static enum hrtimer_restart
amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
if (ret_overrun != 1)
DRM_WARN("%s: vblank timer overrun\n", __func__);
+ dev = crtc->dev;
+ pipe = drm_crtc_index(crtc);
+ vblank = &dev->vblank[pipe];
ret = drm_crtc_handle_vblank(crtc);
- if (!ret)
- DRM_ERROR("amdgpu_vkms failure on handling vblank");
+ /* Don't queue timer again when vblank is disabled. */
+ if (!ret && !READ_ONCE(vblank->enabled))
+ return HRTIMER_NORESTART;
When drm_crtc_handle_vblank() returns false when vblank is disabled I think
we can simplify this to just removing the error.
Regards,
Christian.