Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
When an xgmi node is added to the hive, it takes another hive
reference for its reset domain.
This extra reference was not dropped on device removal from the
hive so drop it.
Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 1b108d03e785..560bf1c98f08 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
mutex_unlock(&hive->hive_lock);
amdgpu_put_xgmi_hive(hive);
+ /* device is removed from the hive so remove its reset domain reference */
+ if (adev->reset_domain && adev->reset_domain == hive->reset_domain)
+ amdgpu_put_xgmi_hive(hive);
This is some messed up reference counting. If you need an extra
reference from the reset_domain to the hive, that should be owned by the
reset_domain and dropped when the reset_domain is destroyed. And it's
only one reference for the reset_domain, not one reference per adev in
the reset_domain.
What you're doing here looks like every adev that's in a reset_domain of
its hive has two references to the hive. And if you're dropping the
extra reference here, it still leaves the reset_domain with a dangling
pointer to a hive that may no longer exist. So this extra reference is
kind of pointless.
Regards,
Felix
adev->hive = NULL;
if (atomic_dec_return(&hive->number_devices) == 0) {