Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan: > > [Public] > > > [Public] > > > I wouldn’t know if it was another bug elsewhere. > > From what I was seeing, the leak was coming from !p->xnack_enable on > the svm_range_restore_pages call. > > If it helps, I saw this on Aldebaran where a shader does some bad > memory access on purpose on a debugged ptraced child process. > On Aldebaran the XNACK mode can be changed per process. But the page fault interrupts are retry faults (until they get turned into no-retry faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults go into svm_range_restore_pages before they realize that the process in question doesn't use XNACK. The patch is Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > The vm fault prompt pops up in dmesgs and a stale KFD process appends > per run without this fix. > > I’m just assuming at this point that the IV retry bit is set but I > never confirmed that. > > > > Thanks, > > > > Jon > > *From:* Yang, Philip <Philip.Yang@xxxxxxx> > *Sent:* Wednesday, September 1, 2021 12:30 PM > *To:* Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Yang, Philip > <Philip.Yang@xxxxxxx>; Sierra Guiza, Alejandro (Alex) > <Alex.Sierra@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack > disable > > > > > > On 2021-09-01 9:45 a.m., Kim, Jonathan wrote: > > [AMD Official Use Only] > > > > We were seeing process leaks on a couple of machines running > certain tests that triggered vm faults on purpose. > > I think svm_range_restore_pages gets called unconditionally on vm > fault handling (unless the retry interrupt payload bit is supposed > to be clear with xnack off)? > > > > yes, with xnack off, sh_mem_config retry should be off, retry bit is > supposed to be clear in fault interrupt vector, we should not try to > recover vm fault, just report the vm fault back to application and > evict user queues. Maybe it is another bug cause p->xnack_enabled and > sh_mem_config retry mismatch under specific condition? > > Regards, > > Philip > > Either way, this patch prevents the process leaks we seeing and is > also: > > Reviewed-by: Jonathan Kim <jonathan.kim@xxxxxxx> > <mailto:jonathan.kim@xxxxxxx> > > > > Thanks, > > > > Jon > > > > > > *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> > <mailto:amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> *On Behalf Of > *philip yang > *Sent:* Wednesday, September 1, 2021 7:30 AM > *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra@xxxxxxx> > <mailto:Alex.Sierra@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when > xnack disable > > > > [CAUTION: External Email] > > > > On 2021-08-31 10:41 p.m., Alex Sierra wrote: > > During svm restore pages interrupt handler, kfd_process ref count was > > never dropped when xnack was disabled. Therefore, the object was never > > released. > > Good catch, but if xnack is off, we should not get here to recover > fault. > > The fix looks good to me. > > Reviewed-by: Philip Yang <philip.yang@xxxxxxx> > <mailto:philip.yang@xxxxxxx> > > > > Signed-off-by: Alex Sierra <alex.sierra@xxxxxxx> <mailto:alex.sierra@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > > index 8f9b5b53dab5..110c46cd7fac 100644 > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > > @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, > > } > > if (!p->xnack_enabled) { > > pr_debug("XNACK not enabled for pasid 0x%x\n", pasid); > > - return -EFAULT; > > + r = -EFAULT; > > + goto out; > > } > > svms = &p->svms; > > >