[RFC PATCH 0/1] Protecting BO list corruption

Luben Tuikov <luben.tuikov@xxxxxxx> · Tue, 12 Jul 2022 01:39:23 -0400

After removing the context lock by patch e68efb27647f21 ("drm/amdgpu:
remove ctx->lock"), we see BO list corruption as documented in the bug of
the link below. While reverting removal of the context lock does fix the
issue, a more comprehensive approach was suggested, which this patch
implements. I'm currently running with this kernel and it works fine,
however running the IGT's amd_cs_nop test, I see a hang in the 4th
sub-test, "sync-gfx0". Previously I've seen it get stuck in the 6th
sub-test, "fork-gfx0".

The hang is generally as follows:

[<0>] ttm_eu_reserve_buffers+0xe7/0x2c0 [ttm]
[<0>] amdgpu_gem_va_ioctl+0x31c/0x540 [amdgpu]
[<0>] drm_ioctl_kernel+0x8c/0x120 [drm]
[<0>] drm_ioctl+0x220/0x3e0 [drm]
[<0>] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[<0>] __x64_sys_ioctl+0x82/0xb0
[<0>] do_syscall_64+0x3b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Generally, something like ttm_eu_reserve_buffers() --> ttm_bo_reserve() -->
... --> dma_resv_lock() --> ww_mutex_lock().

However, while normally using the system, I don't observe such hangs--only
when running the IGT amd_cs_nop test.

Luben Tuikov (1):
  drm/amdgpu: Protect the amdgpu_bo_list list with a mutex

 drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h |  4 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c      | 31 +++++++++++++++++++--
 3 files changed, 35 insertions(+), 3 deletions(-)

Suggested-by: Christian König <christian.koenig@xxxxxxx>
Cc: Alex Deucher <Alexander.Deucher@xxxxxxx>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx>
Cc: Vitaly Prosyak <Vitaly.Prosyak@xxxxxxx>
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2048
Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx>

base-commit: ab7e60938be74e21c723223e7eb96cac7b441e5e
-- 
2.36.1.74.g277cf0bc36