Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2022-08-11 11:34, Kim, Jonathan wrote:
[Public]

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: August 11, 2022 11:19 AM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kim, Jonathan <Jonathan.Kim@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
leak

Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
When an xgmi node is added to the hive, it takes another hive
reference for its reset domain.

This extra reference was not dropped on device removal from the
hive so drop it.

Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 1b108d03e785..560bf1c98f08 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
amdgpu_device *adev)
     mutex_unlock(&hive->hive_lock);

     amdgpu_put_xgmi_hive(hive);
+   /* device is removed from the hive so remove its reset domain
reference */
+   if (adev->reset_domain && adev->reset_domain == hive-
reset_domain)
+           amdgpu_put_xgmi_hive(hive);
This is some messed up reference counting. If you need an extra
reference from the reset_domain to the hive, that should be owned by the
reset_domain and dropped when the reset_domain is destroyed. And it's
only one reference for the reset_domain, not one reference per adev in
the reset_domain.
Cc'ing Andrey.

What you're saying seems to make more sense to me, but what I got from an offline conversation with Andrey
was that the reset domain reference per device was intentional.
Maybe Andrey can comment here.

What you're doing here looks like every adev that's in a reset_domain of
its hive has two references to the hive. And if you're dropping the
extra reference here, it still leaves the reset_domain with a dangling
pointer to a hive that may no longer exist. So this extra reference is
kind of pointless.


reset_domain doesn't have any references to the hive, the hive has a reference to reset_domain


Yes.  Currently one reference is fetched from the device's lifetime on the hive and the other is from the
per-device reset domain.

Snippet from amdgpu_device_ip_init:
         /**
          * In case of XGMI grab extra reference for reset domain for this device
          */
         if (adev->gmc.xgmi.num_physical_nodes > 1) {
                 if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is fetched here


amdgpu_xgmi_add_device calls  amdgpu_get_xgmi_hive and only on the first time amdgpu_get_xgmi_hive is called and hive is actually allocated and initialized  will we proceed to creating the reset domain either from scratch (first creation of the hive) or by taking reference from adev (see [1])



[1] - https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c#L394

                         struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); <- [JK] then here again


So here I don't see how an extra reference to reset_domain is taken if amdgpu_get_xgmi_hive returns early since the hive already created and exists in the global hive container ?

Johantan - can u please show the exact flow how recount leak on reset_domain is happening ?

Andrey



                         if (!hive->reset_domain ||
                             !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
                                 r = -ENOENT;
                                 goto init_failed;
                         }

                         /* Drop the early temporary reset domain we created for device */
                         amdgpu_reset_put_reset_domain(adev->reset_domain);
                         adev->reset_domain = hive->reset_domain;
                 }
         }

One of these never gets dropped so a leak happens.
So either the extra reference has to be dropped on device removal from the hive or from what you've mentioned,
the reset_domain reference fetch should be fixed to grab at the hive/reset_domain level.

Thanks,

Jon

Regards,
    Felix


     adev->hive = NULL;

     if (atomic_dec_return(&hive->number_devices) == 0) {



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux