If it's not too late, please add Fixes: 2383f56bbe4a ("drm/amdkfd: page table restore through svm API") Thanks, Felix Am 2021-09-01 um 1:54 p.m. schrieb Felix Kuehling: > Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan: >> [Public] >> >> >> [Public] >> >> >> I wouldn’t know if it was another bug elsewhere. >> >> From what I was seeing, the leak was coming from !p->xnack_enable on >> the svm_range_restore_pages call. >> >> If it helps, I saw this on Aldebaran where a shader does some bad >> memory access on purpose on a debugged ptraced child process. >> > On Aldebaran the XNACK mode can be changed per process. But the page > fault interrupts are retry faults (until they get turned into no-retry > faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults > go into svm_range_restore_pages before they realize that the process in > question doesn't use XNACK. > > The patch is > > Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > > >> The vm fault prompt pops up in dmesgs and a stale KFD process appends >> per run without this fix. >> >> I’m just assuming at this point that the IV retry bit is set but I >> never confirmed that. >> >> >> >> Thanks, >> >> >> >> Jon >> >> *From:* Yang, Philip <Philip.Yang@xxxxxxx> >> *Sent:* Wednesday, September 1, 2021 12:30 PM >> *To:* Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Yang, Philip >> <Philip.Yang@xxxxxxx>; Sierra Guiza, Alejandro (Alex) >> <Alex.Sierra@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack >> disable >> >> >> >> >> >> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote: >> >> [AMD Official Use Only] >> >> >> >> We were seeing process leaks on a couple of machines running >> certain tests that triggered vm faults on purpose. >> >> I think svm_range_restore_pages gets called unconditionally on vm >> fault handling (unless the retry interrupt payload bit is supposed >> to be clear with xnack off)? >> >> >> >> yes, with xnack off, sh_mem_config retry should be off, retry bit is >> supposed to be clear in fault interrupt vector, we should not try to >> recover vm fault, just report the vm fault back to application and >> evict user queues. Maybe it is another bug cause p->xnack_enabled and >> sh_mem_config retry mismatch under specific condition? >> >> Regards, >> >> Philip >> >> Either way, this patch prevents the process leaks we seeing and is >> also: >> >> Reviewed-by: Jonathan Kim <jonathan.kim@xxxxxxx> >> <mailto:jonathan.kim@xxxxxxx> >> >> >> >> Thanks, >> >> >> >> Jon >> >> >> >> >> >> *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> >> <mailto:amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> *On Behalf Of >> *philip yang >> *Sent:* Wednesday, September 1, 2021 7:30 AM >> *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra@xxxxxxx> >> <mailto:Alex.Sierra@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx> >> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when >> xnack disable >> >> >> >> [CAUTION: External Email] >> >> >> >> On 2021-08-31 10:41 p.m., Alex Sierra wrote: >> >> During svm restore pages interrupt handler, kfd_process ref count was >> >> never dropped when xnack was disabled. Therefore, the object was never >> >> released. >> >> Good catch, but if xnack is off, we should not get here to recover >> fault. >> >> The fix looks good to me. >> >> Reviewed-by: Philip Yang <philip.yang@xxxxxxx> >> <mailto:philip.yang@xxxxxxx> >> >> >> >> Signed-off-by: Alex Sierra <alex.sierra@xxxxxxx> <mailto:alex.sierra@xxxxxxx> >> >> --- >> >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++- >> >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> >> index 8f9b5b53dab5..110c46cd7fac 100644 >> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> >> @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, >> >> } >> >> if (!p->xnack_enabled) { >> >> pr_debug("XNACK not enabled for pasid 0x%x\n", pasid); >> >> - return -EFAULT; >> >> + r = -EFAULT; >> >> + goto out; >> >> } >> >> svms = &p->svms; >> >> >>