As the delayed free pt, the wanted freed bo has been reused which will cause unexpected page fault, and then call svm_range_restore_pages. Detail as below: 1.It wants to free the pt in follow code, but it is not freed immediately and used “schedule_work(&vm->pt_free_work);”. [ 92.276838] Call Trace: [ 92.276841] dump_stack+0x63/0xa0 [ 92.276887] amdgpu_vm_pt_free_list+0xfb/0x120 [amdgpu] [ 92.276932] amdgpu_vm_update_range+0x69c/0x8e0 [amdgpu] [ 92.276990] svm_range_unmap_from_gpus+0x112/0x310 [amdgpu] [ 92.277046] svm_range_cpu_invalidate_pagetables+0x725/0x780 [amdgpu] [ 92.277050] ? __alloc_pages_nodemask+0x19f/0x3e0 [ 92.277051] mn_itree_invalidate+0x72/0xc0 [ 92.277052] __mmu_notifier_invalidate_range_start+0x48/0x60 [ 92.277054] migrate_vma_collect+0xf6/0x100 [ 92.277055] migrate_vma_setup+0xcf/0x120 [ 92.277109] svm_migrate_ram_to_vram+0x256/0x6b0 [amdgpu] 2.Call svm_range_map_to_gpu->amdgpu_vm_update_range to update the page table, at this time it will use the same entry bo which is the want free bo in step1. 3.Then it executes the pt_free_work to free the bo. At this time, the GPU access memory will cause page fault as pt bo has been freed. And then it will call svm_range_restore_pages again. How to fix? Add a workqueue, and flush the workqueue each time before updating page table. Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c | 2 +- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 93c352b08969..cbf68ad1c8d0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1188,6 +1188,7 @@ struct amdgpu_device { struct mutex enforce_isolation_mutex; struct amdgpu_init_level *init_lvl; + struct workqueue_struct *wq; }; static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 9d6ffe38b48a..4718074613fe 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -982,6 +982,7 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, struct amdgpu_vm *vm, */ flush_tlb |= amdgpu_ip_version(adev, GC_HWIP, 0) < IP_VERSION(9, 0, 0); + flush_workqueue(adev->wq); memset(¶ms, 0, sizeof(params)); params.adev = adev; params.vm = vm; @@ -2607,7 +2608,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm); flush_work(&vm->pt_free_work); - + cancel_work_sync(&vm->pt_free_work); root = amdgpu_bo_ref(vm->root.bo); amdgpu_bo_reserve(root, true); amdgpu_vm_put_task_info(vm->task_info); @@ -2708,6 +2709,8 @@ void amdgpu_vm_manager_init(struct amdgpu_device *adev) #endif xa_init_flags(&adev->vm_manager.pasids, XA_FLAGS_LOCK_IRQ); + adev->wq = alloc_workqueue("amdgpu_recycle", + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 16); } /** @@ -2721,7 +2724,8 @@ void amdgpu_vm_manager_fini(struct amdgpu_device *adev) { WARN_ON(!xa_empty(&adev->vm_manager.pasids)); xa_destroy(&adev->vm_manager.pasids); - + flush_workqueue(adev->wq); + destroy_workqueue(adev->wq); amdgpu_vmid_mgr_fini(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c index f78a0434a48f..7543c428873b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c @@ -589,7 +589,7 @@ void amdgpu_vm_pt_free_list(struct amdgpu_device *adev, spin_lock(&vm->status_lock); list_splice_init(¶ms->tlb_flush_waitlist, &vm->pt_freed); spin_unlock(&vm->status_lock); - schedule_work(&vm->pt_free_work); + queue_work(adev->wq, &vm->pt_free_work); return; } -- 2.34.1