Re: [PATCH] drm/amdgpu: Fix vm free pts race when process exiting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2021-08-04 5:01 a.m., Christian König wrote:
Am 03.08.21 um 00:33 schrieb Philip Yang:
Take vm->invalidated_lock spinlock to remove vm pd and pt bos, to avoid
link list corruption with crash backtrace:

[ 2290.505111] list_del corruption. next->prev should be
  ffff9b2514ec0018, but was 4e03280211010f04
[ 2290.505154] kernel BUG at lib/list_debug.c:56!
[ 2290.505176] invalid opcode: 0000 [#1] SMP NOPTI
[ 2290.505254] Workqueue: events delayed_fput
[ 2290.505271] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
[ 2290.505513] Call Trace:
[ 2290.505623]  amdgpu_vm_free_table+0x26/0x80 [amdgpu]
[ 2290.505705]  amdgpu_vm_free_pts+0x7a/0xf0 [amdgpu]
[ 2290.505786]  amdgpu_vm_fini+0x1f0/0x440 [amdgpu]
[ 2290.505864]  amdgpu_driver_postclose_kms+0x172/0x290 [amdgpu]
[ 2290.505893]  drm_file_free.part.10+0x1d4/0x270 [drm]
[ 2290.505916]  drm_release+0xa9/0xe0 [drm]
[ 2290.505930]  __fput+0xb7/0x230
[ 2290.505942]  delayed_fput+0x1c/0x30
[ 2290.505957]  process_one_work+0x1a7/0x360
[ 2290.505971]  worker_thread+0x30/0x390
[ 2290.505985]  ? create_worker+0x1a0/0x1a0
[ 2290.505999]  kthread+0x112/0x130
[ 2290.506011]  ? kthread_flush_work_fn+0x10/0x10
[ 2290.506027]  ret_from_fork+0x22/0x40

Wow, well this is a big NAK.

The page tables should never ever be on the invalidation list or otherwise we would try to point PTEs to them which is a huge security issue.

Taking the lock just workaround that. Can you investigate how it happens that a page table ends up on that list?

I will be off 3 days, I can investigate this further next Monday. From debug, amdgpu_vm_bo_invalidate is called many times. The background: app has 8 processes, running on 8 GPUs, 64 VMs, VRAM usage is around 85% from rocm-info, 1/5 chance this happens (kernel BUG, server die) after application segmentation fault crash (another app issue). It is new issue on rocm release 4.3, never happened before on rocm 4.1 (no app crash on 4.1 either). kernel version is 4.18 on CentOS.

Regards,

Philip


Thanks in advance,
Christian.


Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 2a88ed5d983b..5c4c355e7d6b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1045,7 +1045,9 @@ static void amdgpu_vm_free_table(struct amdgpu_vm_bo_base *entry)
          return;
      shadow = amdgpu_bo_shadowed(entry->bo);
      entry->bo->vm_bo = NULL;
+    spin_lock(&entry->vm->invalidated_lock);
      list_del(&entry->vm_status);
+    spin_unlock(&entry->vm->invalidated_lock);
      amdgpu_bo_unref(&shadow);
      amdgpu_bo_unref(&entry->bo);
  }


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux