On 2019-09-09 8:03 a.m., Christian König wrote: > Am 04.09.19 um 22:12 schrieb Yang, Philip: >> This series looks nice and clear for me, two questions embedded below. >> >> Are we going to use dedicated sdma page queue for direct VM update path >> during a fault? >> >> Thanks, >> Philip >> >> On 2019-09-04 11:02 a.m., Christian König wrote: >>> Next step towards HMM support. For now just silence the retry fault and >>> optionally redirect the request to the dummy page. >>> >>> v2: make sure the VM is not destroyed while we handle the fault. >>> >>> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 74 >>> ++++++++++++++++++++++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 + >>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++ >>> 3 files changed, 80 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> index 951608fc1925..410d89966a66 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> @@ -3142,3 +3142,77 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm >>> *vm) >>> } >>> } >>> } >>> + >>> +/** >>> + * amdgpu_vm_handle_fault - graceful handling of VM faults. >>> + * @adev: amdgpu device pointer >>> + * @pasid: PASID of the VM >>> + * @addr: Address of the fault >>> + * >>> + * Try to gracefully handle a VM fault. Return true if the fault was >>> handled and >>> + * shouldn't be reported any more. >>> + */ >>> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int >>> pasid, >>> + uint64_t addr) >>> +{ >>> + struct amdgpu_ring *ring = &adev->sdma.instance[0].page; >>> + struct amdgpu_bo *root; >>> + uint64_t value, flags; >>> + struct amdgpu_vm *vm; >>> + long r; >>> + >>> + if (!ring->sched.ready) >>> + return false; >>> + >>> + spin_lock(&adev->vm_manager.pasid_lock); >>> + vm = idr_find(&adev->vm_manager.pasid_idr, pasid); >>> + if (vm) >>> + root = amdgpu_bo_ref(vm->root.base.bo); >>> + else >>> + root = NULL; >>> + spin_unlock(&adev->vm_manager.pasid_lock); >>> + >>> + if (!root) >>> + return false; >>> + >>> + r = amdgpu_bo_reserve(root, true); >>> + if (r) >>> + goto error_unref; >>> + >>> + spin_lock(&adev->vm_manager.pasid_lock); >>> + vm = idr_find(&adev->vm_manager.pasid_idr, pasid); >>> + spin_unlock(&adev->vm_manager.pasid_lock); >>> + >> Here get vm from pasid second time, and check if PD bo is changed, is >> this to handle vm fault race with vm destory? > > Yes, exactly. > >> >>> + if (!vm || vm->root.base.bo != root) >>> + goto error_unlock; >>> + >>> + addr /= AMDGPU_GPU_PAGE_SIZE; >>> + flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED | >>> + AMDGPU_PTE_SYSTEM; >>> + >>> + if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) { >>> + /* Redirect the access to the dummy page */ >>> + value = adev->dummy_page_addr; >>> + flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE | >>> + AMDGPU_PTE_WRITEABLE; >>> + } else { >>> + value = 0; >>> + } >>> + >>> + r = amdgpu_vm_bo_update_mapping(adev, vm, true, NULL, addr, addr >>> + 1, >>> + flags, value, NULL, NULL); >>> + if (r) >>> + goto error_unlock; >>> + >> After fault address redirect to dummy page, will the fault recover and >> retry continue to execute? > > Yes, the read/write operation will just retry and use the value from the > dummy page instead. > >> Is this dangerous to update PTE to use system >> memory address 0? > > What are you talking about? The dummy page is a page allocate by TTM > where we redirect faulty accesses to. > For amdgpu_vm_fault_stop equals to AMDGPU_VM_FAULT_STOP_FIRST/ALWAYS case, value is 0, this will redirect to system memory 0. Maybe redirect is only needed for AMDGPU_VM_FAULT_STOP_NEVER? Regards, Philip > Regards, > Christian. > >> >>> + r = amdgpu_vm_update_pdes(adev, vm, true); >>> + >>> +error_unlock: >>> + amdgpu_bo_unreserve(root); >>> + if (r < 0) >>> + DRM_ERROR("Can't handle page fault (%ld)\n", r); >>> + >>> +error_unref: >>> + amdgpu_bo_unref(&root); >>> + >>> + return false; >>> +} >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>> index 0a97dc839f3b..4dbbe1b6b413 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>> @@ -413,6 +413,8 @@ void amdgpu_vm_check_compute_bug(struct >>> amdgpu_device *adev); >>> void amdgpu_vm_get_task_info(struct amdgpu_device *adev, unsigned >>> int pasid, >>> struct amdgpu_task_info *task_info); >>> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int >>> pasid, >>> + uint64_t addr); >>> void amdgpu_vm_set_task_info(struct amdgpu_vm *vm); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> index 9d15679df6e0..15a1ce51befa 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> @@ -353,6 +353,10 @@ static int gmc_v9_0_process_interrupt(struct >>> amdgpu_device *adev, >>> } >>> /* If it's the first fault for this address, process it >>> normally */ >>> + if (retry_fault && !in_interrupt() && >>> + amdgpu_vm_handle_fault(adev, entry->pasid, addr)) >>> + return 1; /* This also prevents sending it to KFD */ >>> + >>> if (!amdgpu_sriov_vf(adev)) { >>> /* >>> * Issue a dummy read to wait for the status register to >>> > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx