Re: [PATCH] drm/amdgpu: Update irq disable flow during unload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 08.01.24 um 09:32 schrieb Lazar, Lijo:
On 1/8/2024 1:51 PM, Christian König wrote:
Am 08.01.24 um 09:13 schrieb Kamal, Asad:
[AMD Official Use Only - General]

Hi Christian,

Thank you for the comment.

This is not normal reset, it is reset done during unload for smu v_13_0_2.

Yeah, but this doesn't explain the rational for this.

IRQ enable/disable should be balanced in hw_init()/hw_fini(), independent of what else you do.

I'm not sure what you are trying to solve but this here is a complete no-go.


This is a special reset done during module unload by this commit - f5c7e7797060 ("drm/amdgpu: Adjust removal control flow for smu
v13_0_2"). Without this commit, it seems driver reload doesnt' work.

In this particular case, a the reset is done during unload and only resume sequence of only select IPs are done (part of the workaround in the patch). For those IPs, irqs are enabled during late_init/ras_late_init, and not during hw_init(), that part gets skipped.

Please revert that immediately, this whole approach is completely broken as far as I can see.


The module unload sequence causes a WARN trace during irq_put of those IPs during hw_fini(). Those are mix of generic irqs and ras irqs, so there is no clean way to untangle it.

This is WARN is just the tip of the iceberg here, the problem is that you are not supposed to call amdgpu_device_ip_resume_phase1() as you do in f5c7e7797060.

Please sync up with Alex how to do this cleanly.

Regards,
Christian.


One thing that could be done is to add an extra check for 13.0.2 version to make it clear that this workaround is done for only for those ASICs.

Thanks,
Lijo

Regards,
Christian.


Thanks & Regards
Asad

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Monday, January 8, 2024 1:33 PM
To: Kamal, Asad <Asad.Kamal@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Update irq disable flow during unload

Am 05.01.24 um 16:21 schrieb Asad Kamal:
In certain special cases, e.g device reset before module unload, irq
gets disabled as part of reset sequence and won't get enabled back.
Add special check to cover such scenarios
Well complete NAK to that. Resets shouldn't affect the IRQ state at all!

If this is an issue then something else is broken.

Regards,
Christian.

Signed-off-by: Asad Kamal <asad.kamal@xxxxxxx>
Suggested-by: Lijo Lazar <lijo.lazar@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 12 ++++++++++--
   drivers/gpu/drm/amd/amdgpu/soc15.c    | 13 +++++++++++--
   2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 372de9f1ce59..a4e1b9a58679 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -2361,6 +2361,7 @@ static void gmc_v9_0_gart_disable(struct amdgpu_device *adev)
   static int gmc_v9_0_hw_fini(void *handle)
   {
       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+     bool irq_release = true;

       gmc_v9_0_gart_disable(adev);

@@ -2378,9 +2379,16 @@ static int gmc_v9_0_hw_fini(void *handle)
       if (adev->mmhub.funcs->update_power_gating)
adev->mmhub.funcs->update_power_gating(adev, false);

-     amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
+     if (adev->shutdown)
+             irq_release = amdgpu_irq_enabled(adev, &adev->gmc.vm_fault, 0);

-     if (adev->gmc.ecc_irq.funcs &&
+     if (irq_release)
+             amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
+
+     if (adev->shutdown)
+             irq_release = amdgpu_irq_enabled(adev, &adev->gmc.ecc_irq, 0);
+
+     if (adev->gmc.ecc_irq.funcs && irq_release &&
               amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC))
               amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 15033efec2ba..7ee835049d57 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -1266,6 +1266,7 @@ static int soc15_common_hw_init(void *handle)
   static int soc15_common_hw_fini(void *handle)
   {
       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+     bool irq_release = true;

       /* Disable the doorbell aperture and selfring doorbell aperture
        * separately in hw_fini because soc15_enable_doorbell_aperture @@
-1280,10 +1281,18 @@ static int soc15_common_hw_fini(void *handle)

       if (adev->nbio.ras_if &&
           amdgpu_ras_is_supported(adev, adev->nbio.ras_if->block)) {
-             if (adev->nbio.ras &&
+             if (adev->shutdown)
+                     irq_release = amdgpu_irq_enabled(adev,
+&adev->nbio.ras_controller_irq, 0);
+
+             if (adev->nbio.ras && irq_release &&
adev->nbio.ras->init_ras_controller_interrupt)
                       amdgpu_irq_put(adev, &adev->nbio.ras_controller_irq, 0);
-             if (adev->nbio.ras &&
+
+             if (adev->shutdown)
+                     irq_release = amdgpu_irq_enabled(adev,
+ &adev->nbio.ras_err_event_athub_irq, 0);
+
+             if (adev->nbio.ras && irq_release &&
adev->nbio.ras->init_ras_err_event_athub_interrupt)
                       amdgpu_irq_put(adev, &adev->nbio.ras_err_event_athub_irq, 0);
       }






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux