On 2019-09-04 11:02 a.m., Christian König wrote: > Next step towards HMM support. For now just silence the retry fault and > optionally redirect the request to the dummy page. > > v2: make sure the VM is not destroyed while we handle the fault. > > Signed-off-by: Christian König <christian.koenig@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 74 ++++++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 + > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++ > 3 files changed, 80 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 951608fc1925..410d89966a66 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -3142,3 +3142,77 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm) > } > } > } > + > +/** > + * amdgpu_vm_handle_fault - graceful handling of VM faults. > + * @adev: amdgpu device pointer > + * @pasid: PASID of the VM > + * @addr: Address of the fault > + * > + * Try to gracefully handle a VM fault. Return true if the fault was handled and > + * shouldn't be reported any more. > + */ > +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, > + uint64_t addr) > +{ > + struct amdgpu_ring *ring = &adev->sdma.instance[0].page; > + struct amdgpu_bo *root; > + uint64_t value, flags; > + struct amdgpu_vm *vm; > + long r; > + > + if (!ring->sched.ready) > + return false; > + > + spin_lock(&adev->vm_manager.pasid_lock); > + vm = idr_find(&adev->vm_manager.pasid_idr, pasid); > + if (vm) > + root = amdgpu_bo_ref(vm->root.base.bo); > + else > + root = NULL; > + spin_unlock(&adev->vm_manager.pasid_lock); > + > + if (!root) > + return false; > + > + r = amdgpu_bo_reserve(root, true); > + if (r) > + goto error_unref; > + > + spin_lock(&adev->vm_manager.pasid_lock); > + vm = idr_find(&adev->vm_manager.pasid_idr, pasid); > + spin_unlock(&adev->vm_manager.pasid_lock); I think this deserves a comment. If I understand it correctly, you're looking up the vm twice so that you have the VM root reservation to protect against user-after-free. Otherwise the vm pointer is only valid as long as you're holding the spin-lock. > + > + if (!vm || vm->root.base.bo != root) The check of vm->root.base.bo should probably still be under the spin_lock. Because you're not sure yet it's the right VM, you can't rely on the reservation here to prevent use-after-free. > + goto error_unlock; > + > + addr /= AMDGPU_GPU_PAGE_SIZE; > + flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED | > + AMDGPU_PTE_SYSTEM; > + > + if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) { > + /* Redirect the access to the dummy page */ > + value = adev->dummy_page_addr; > + flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE | > + AMDGPU_PTE_WRITEABLE; > + } else { > + value = 0; > + } > + > + r = amdgpu_vm_bo_update_mapping(adev, vm, true, NULL, addr, addr + 1, > + flags, value, NULL, NULL); > + if (r) > + goto error_unlock; > + > + r = amdgpu_vm_update_pdes(adev, vm, true); > + > +error_unlock: > + amdgpu_bo_unreserve(root); > + if (r < 0) > + DRM_ERROR("Can't handle page fault (%ld)\n", r); > + > +error_unref: > + amdgpu_bo_unref(&root); > + > + return false; > +} > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > index 0a97dc839f3b..4dbbe1b6b413 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > @@ -413,6 +413,8 @@ void amdgpu_vm_check_compute_bug(struct amdgpu_device *adev); > > void amdgpu_vm_get_task_info(struct amdgpu_device *adev, unsigned int pasid, > struct amdgpu_task_info *task_info); > +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, > + uint64_t addr); > > void amdgpu_vm_set_task_info(struct amdgpu_vm *vm); > > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > index 9d15679df6e0..15a1ce51befa 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > @@ -353,6 +353,10 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device *adev, > } > > /* If it's the first fault for this address, process it normally */ > + if (retry_fault && !in_interrupt() && > + amdgpu_vm_handle_fault(adev, entry->pasid, addr)) > + return 1; /* This also prevents sending it to KFD */ The !in_interrupt() is meant to only do this on the rerouted interrupt ring that's handled by a worker function? Looks like amdgpu_vm_handle_fault never returns true for now. So we'll never get to the "return 1" here. Regards, Felix > + > if (!amdgpu_sriov_vf(adev)) { > /* > * Issue a dummy read to wait for the status register to _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx