Am 09.09.19 um 15:58 schrieb Yang, Philip: > > On 2019-09-09 8:03 a.m., Christian König wrote: >> Am 04.09.19 um 22:12 schrieb Yang, Philip: >>> This series looks nice and clear for me, two questions embedded below. >>> >>> Are we going to use dedicated sdma page queue for direct VM update path >>> during a fault? >>> >>> Thanks, >>> Philip >>> >>> On 2019-09-04 11:02 a.m., Christian König wrote: >>>> Next step towards HMM support. For now just silence the retry fault and >>>> optionally redirect the request to the dummy page. >>>> >>>> v2: make sure the VM is not destroyed while we handle the fault. >>>> >>>> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 74 >>>> ++++++++++++++++++++++++++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 + >>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++ >>>> 3 files changed, 80 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> index 951608fc1925..410d89966a66 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> @@ -3142,3 +3142,77 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm >>>> *vm) >>>> } >>>> } >>>> } >>>> + >>>> +/** >>>> + * amdgpu_vm_handle_fault - graceful handling of VM faults. >>>> + * @adev: amdgpu device pointer >>>> + * @pasid: PASID of the VM >>>> + * @addr: Address of the fault >>>> + * >>>> + * Try to gracefully handle a VM fault. Return true if the fault was >>>> handled and >>>> + * shouldn't be reported any more. >>>> + */ >>>> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int >>>> pasid, >>>> + uint64_t addr) >>>> +{ >>>> + struct amdgpu_ring *ring = &adev->sdma.instance[0].page; >>>> + struct amdgpu_bo *root; >>>> + uint64_t value, flags; >>>> + struct amdgpu_vm *vm; >>>> + long r; >>>> + >>>> + if (!ring->sched.ready) >>>> + return false; >>>> + >>>> + spin_lock(&adev->vm_manager.pasid_lock); >>>> + vm = idr_find(&adev->vm_manager.pasid_idr, pasid); >>>> + if (vm) >>>> + root = amdgpu_bo_ref(vm->root.base.bo); >>>> + else >>>> + root = NULL; >>>> + spin_unlock(&adev->vm_manager.pasid_lock); >>>> + >>>> + if (!root) >>>> + return false; >>>> + >>>> + r = amdgpu_bo_reserve(root, true); >>>> + if (r) >>>> + goto error_unref; >>>> + >>>> + spin_lock(&adev->vm_manager.pasid_lock); >>>> + vm = idr_find(&adev->vm_manager.pasid_idr, pasid); >>>> + spin_unlock(&adev->vm_manager.pasid_lock); >>>> + >>> Here get vm from pasid second time, and check if PD bo is changed, is >>> this to handle vm fault race with vm destory? >> Yes, exactly. >> >>>> + if (!vm || vm->root.base.bo != root) >>>> + goto error_unlock; >>>> + >>>> + addr /= AMDGPU_GPU_PAGE_SIZE; >>>> + flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED | >>>> + AMDGPU_PTE_SYSTEM; >>>> + >>>> + if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) { >>>> + /* Redirect the access to the dummy page */ >>>> + value = adev->dummy_page_addr; >>>> + flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE | >>>> + AMDGPU_PTE_WRITEABLE; >>>> + } else { >>>> + value = 0; >>>> + } >>>> + >>>> + r = amdgpu_vm_bo_update_mapping(adev, vm, true, NULL, addr, addr >>>> + 1, >>>> + flags, value, NULL, NULL); >>>> + if (r) >>>> + goto error_unlock; >>>> + >>> After fault address redirect to dummy page, will the fault recover and >>> retry continue to execute? >> Yes, the read/write operation will just retry and use the value from the >> dummy page instead. >> >>> Is this dangerous to update PTE to use system >>> memory address 0? >> What are you talking about? The dummy page is a page allocate by TTM >> where we redirect faulty accesses to. >> > For amdgpu_vm_fault_stop equals to AMDGPU_VM_FAULT_STOP_FIRST/ALWAYS > case, value is 0, this will redirect to system memory 0. Maybe redirect > is only needed for AMDGPU_VM_FAULT_STOP_NEVER? The value 0 doesn't redirect to system memory, it results in a silence retry when neither the R nor the W bit is set in a PTE. Regards, Christian. > > Regards, > Philip > >> Regards, >> Christian. >> >>>> + r = amdgpu_vm_update_pdes(adev, vm, true); >>>> + >>>> +error_unlock: >>>> + amdgpu_bo_unreserve(root); >>>> + if (r < 0) >>>> + DRM_ERROR("Can't handle page fault (%ld)\n", r); >>>> + >>>> +error_unref: >>>> + amdgpu_bo_unref(&root); >>>> + >>>> + return false; >>>> +} >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> index 0a97dc839f3b..4dbbe1b6b413 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> @@ -413,6 +413,8 @@ void amdgpu_vm_check_compute_bug(struct >>>> amdgpu_device *adev); >>>> void amdgpu_vm_get_task_info(struct amdgpu_device *adev, unsigned >>>> int pasid, >>>> struct amdgpu_task_info *task_info); >>>> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int >>>> pasid, >>>> + uint64_t addr); >>>> void amdgpu_vm_set_task_info(struct amdgpu_vm *vm); >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> index 9d15679df6e0..15a1ce51befa 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> @@ -353,6 +353,10 @@ static int gmc_v9_0_process_interrupt(struct >>>> amdgpu_device *adev, >>>> } >>>> /* If it's the first fault for this address, process it >>>> normally */ >>>> + if (retry_fault && !in_interrupt() && >>>> + amdgpu_vm_handle_fault(adev, entry->pasid, addr)) >>>> + return 1; /* This also prevents sending it to KFD */ >>>> + >>>> if (!amdgpu_sriov_vf(adev)) { >>>> /* >>>> * Issue a dummy read to wait for the status register to >>>> _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx