Am 04.08.21 um 17:11 schrieb philip yang:
On 2021-08-04 5:01 a.m., Christian König wrote:
Am 03.08.21 um 00:33 schrieb Philip Yang:
Take vm->invalidated_lock spinlock to remove vm pd and pt bos, to avoid
link list corruption with crash backtrace:
[ 2290.505111] list_del corruption. next->prev should be
ffff9b2514ec0018, but was 4e03280211010f04
[ 2290.505154] kernel BUG at lib/list_debug.c:56!
[ 2290.505176] invalid opcode: 0000 [#1] SMP NOPTI
[ 2290.505254] Workqueue: events delayed_fput
[ 2290.505271] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
[ 2290.505513] Call Trace:
[ 2290.505623] amdgpu_vm_free_table+0x26/0x80 [amdgpu]
[ 2290.505705] amdgpu_vm_free_pts+0x7a/0xf0 [amdgpu]
[ 2290.505786] amdgpu_vm_fini+0x1f0/0x440 [amdgpu]
[ 2290.505864] amdgpu_driver_postclose_kms+0x172/0x290 [amdgpu]
[ 2290.505893] drm_file_free.part.10+0x1d4/0x270 [drm]
[ 2290.505916] drm_release+0xa9/0xe0 [drm]
[ 2290.505930] __fput+0xb7/0x230
[ 2290.505942] delayed_fput+0x1c/0x30
[ 2290.505957] process_one_work+0x1a7/0x360
[ 2290.505971] worker_thread+0x30/0x390
[ 2290.505985] ? create_worker+0x1a0/0x1a0
[ 2290.505999] kthread+0x112/0x130
[ 2290.506011] ? kthread_flush_work_fn+0x10/0x10
[ 2290.506027] ret_from_fork+0x22/0x40
Wow, well this is a big NAK.
The page tables should never ever be on the invalidation list or
otherwise we would try to point PTEs to them which is a huge security
issue.
Taking the lock just workaround that. Can you investigate how it
happens that a page table ends up on that list?
I will be off 3 days, I can investigate this further next Monday. From
debug, amdgpu_vm_bo_invalidate is called many times. The background:
app has 8 processes, running on 8 GPUs, 64 VMs, VRAM usage is around
85% from rocm-info, 1/5 chance this happens (kernel BUG, server die)
after application segmentation fault crash (another app issue). It is
new issue on rocm release 4.3, never happened before on rocm 4.1 (no
app crash on 4.1 either). kernel version is 4.18 on CentOS.
It could be a bug in the VM code. I'm on vacation till 16th of August,
but I will try to take a look when I have time.
Thanks for the notice,
Christian.
Regards,
Philip
Thanks in advance,
Christian.
Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 2a88ed5d983b..5c4c355e7d6b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1045,7 +1045,9 @@ static void amdgpu_vm_free_table(struct
amdgpu_vm_bo_base *entry)
return;
shadow = amdgpu_bo_shadowed(entry->bo);
entry->bo->vm_bo = NULL;
+ spin_lock(&entry->vm->invalidated_lock);
list_del(&entry->vm_status);
+ spin_unlock(&entry->vm->invalidated_lock);
amdgpu_bo_unref(&shadow);
amdgpu_bo_unref(&entry->bo);
}