Am 2021-08-04 um 5:01 a.m. schrieb Christian König: > Am 03.08.21 um 00:33 schrieb Philip Yang: >> Take vm->invalidated_lock spinlock to remove vm pd and pt bos, to avoid >> link list corruption with crash backtrace: >> >> [ 2290.505111] list_del corruption. next->prev should be >> ffff9b2514ec0018, but was 4e03280211010f04 >> [ 2290.505154] kernel BUG at lib/list_debug.c:56! >> [ 2290.505176] invalid opcode: 0000 [#1] SMP NOPTI >> [ 2290.505254] Workqueue: events delayed_fput >> [ 2290.505271] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c >> [ 2290.505513] Call Trace: >> [ 2290.505623] amdgpu_vm_free_table+0x26/0x80 [amdgpu] >> [ 2290.505705] amdgpu_vm_free_pts+0x7a/0xf0 [amdgpu] >> [ 2290.505786] amdgpu_vm_fini+0x1f0/0x440 [amdgpu] >> [ 2290.505864] amdgpu_driver_postclose_kms+0x172/0x290 [amdgpu] >> [ 2290.505893] drm_file_free.part.10+0x1d4/0x270 [drm] >> [ 2290.505916] drm_release+0xa9/0xe0 [drm] >> [ 2290.505930] __fput+0xb7/0x230 >> [ 2290.505942] delayed_fput+0x1c/0x30 >> [ 2290.505957] process_one_work+0x1a7/0x360 >> [ 2290.505971] worker_thread+0x30/0x390 >> [ 2290.505985] ? create_worker+0x1a0/0x1a0 >> [ 2290.505999] kthread+0x112/0x130 >> [ 2290.506011] ? kthread_flush_work_fn+0x10/0x10 >> [ 2290.506027] ret_from_fork+0x22/0x40 > > Wow, well this is a big NAK. > > The page tables should never ever be on the invalidation list or > otherwise we would try to point PTEs to them which is a huge security > issue. entry->vm_status is used on other lists as well, and I think page tables can be added to those when they are evicted: static void amdgpu_vm_bo_evicted(struct amdgpu_vm_bo_base *vm_bo) { struct amdgpu_vm *vm = vm_bo->vm; struct amdgpu_bo *bo = vm_bo->bo; vm_bo->moved = true; if (bo->tbo.type == ttm_bo_type_kernel) list_move(&vm_bo->vm_status, &vm->evicted); else list_move_tail(&vm_bo->vm_status, &vm->evicted); } But that never seems to involve the invalidated_lock. Maybe there is some other lock we should be holding somewhere (not necessarily in amdgpu_vm_free_table) to avoid this list corruption. Regards, Felix > > Taking the lock just workaround that. Can you investigate how it > happens that a page table ends up on that list? > > Thanks in advance, > Christian. > >> >> Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index 2a88ed5d983b..5c4c355e7d6b 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -1045,7 +1045,9 @@ static void amdgpu_vm_free_table(struct >> amdgpu_vm_bo_base *entry) >> return; >> shadow = amdgpu_bo_shadowed(entry->bo); >> entry->bo->vm_bo = NULL; >> + spin_lock(&entry->vm->invalidated_lock); >> list_del(&entry->vm_status); >> + spin_unlock(&entry->vm->invalidated_lock); >> amdgpu_bo_unref(&shadow); >> amdgpu_bo_unref(&entry->bo); >> } >