[Public] > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: August 11, 2022 11:19 AM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kim, Jonathan <Jonathan.Kim@xxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference > leak > > Am 2022-08-11 um 09:42 schrieb Jonathan Kim: > > When an xgmi node is added to the hive, it takes another hive > > reference for its reset domain. > > > > This extra reference was not dropped on device removal from the > > hive so drop it. > > > > Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > > index 1b108d03e785..560bf1c98f08 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > > @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct > amdgpu_device *adev) > > mutex_unlock(&hive->hive_lock); > > > > amdgpu_put_xgmi_hive(hive); > > + /* device is removed from the hive so remove its reset domain > reference */ > > + if (adev->reset_domain && adev->reset_domain == hive- > >reset_domain) > > + amdgpu_put_xgmi_hive(hive); > > This is some messed up reference counting. If you need an extra > reference from the reset_domain to the hive, that should be owned by the > reset_domain and dropped when the reset_domain is destroyed. And it's > only one reference for the reset_domain, not one reference per adev in > the reset_domain. Cc'ing Andrey. What you're saying seems to make more sense to me, but what I got from an offline conversation with Andrey was that the reset domain reference per device was intentional. Maybe Andrey can comment here. > > What you're doing here looks like every adev that's in a reset_domain of > its hive has two references to the hive. And if you're dropping the > extra reference here, it still leaves the reset_domain with a dangling > pointer to a hive that may no longer exist. So this extra reference is > kind of pointless. Yes. Currently one reference is fetched from the device's lifetime on the hive and the other is from the per-device reset domain. Snippet from amdgpu_device_ip_init: /** * In case of XGMI grab extra reference for reset domain for this device */ if (adev->gmc.xgmi.num_physical_nodes > 1) { if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is fetched here struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); <- [JK] then here again if (!hive->reset_domain || !amdgpu_reset_get_reset_domain(hive->reset_domain)) { r = -ENOENT; goto init_failed; } /* Drop the early temporary reset domain we created for device */ amdgpu_reset_put_reset_domain(adev->reset_domain); adev->reset_domain = hive->reset_domain; } } One of these never gets dropped so a leak happens. So either the extra reference has to be dropped on device removal from the hive or from what you've mentioned, the reset_domain reference fetch should be fixed to grab at the hive/reset_domain level. Thanks, Jon > > Regards, > Felix > > > > adev->hive = NULL; > > > > if (atomic_dec_return(&hive->number_devices) == 0) {