[AMD Official Use Only - General] Add @Zhang, Horatio Gfx11 should be addressed by Horatio's patch, not sure he committed yet. The solution is retiring cp_ecc_irq funcs since gfx11 doesn't rely on the irq for any software ras feature. Gfx9 could still add RAS block check since we have legacy ras feature that needs the interrupt. Hi Horatio, Did you commit your fix yet? Regards, Hawking -----Original Message----- From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> Sent: Monday, May 8, 2023 10:16 To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: RE: [PATCH] drm/amdgpu/gfx: disable cp_ecc_error_irq only when gfx ras is enabled in suspend [AMD Official Use Only - General] Reviewed-by: Tao Zhou <tao.zhou1@xxxxxxx> > -----Original Message----- > From: Chen, Guchun <Guchun.Chen@xxxxxxx> > Sent: Saturday, May 6, 2023 8:16 PM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; > Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; > Koenig, Christian <Christian.Koenig@xxxxxxx> > Cc: Chen, Guchun <Guchun.Chen@xxxxxxx> > Subject: [PATCH] drm/amdgpu/gfx: disable cp_ecc_error_irq only when > gfx ras is enabled in suspend > > cp_ecc_error_irq is only enabled when gfx ras is assert. > So in gfx_v9_0_hw_fini, interrupt disablement for cp_ecc_error_irq > should be executed under such condition, otherwise, an amdgpu_irq_put > calltrace will occur. > > [ 7283.170322] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu] [ > 7283.170964] > RSP: 0018:ffff9a5fc3967d00 EFLAGS: 00010246 [ 7283.170967] RAX: > ffff98d88afd3040 RBX: ffff98d89da20000 RCX: 0000000000000000 [ > 7283.170969] RDX: 0000000000000000 RSI: ffff98d89da2bef8 RDI: > ffff98d89da20000 [ 7283.170971] RBP: ffff98d89da20000 R08: > ffff98d89da2ca18 R09: 0000000000000006 [ 7283.170973] R10: > ffffd5764243c008 R11: 0000000000000000 R12: 0000000000001050 [ > 7283.170975] R13: ffff98d89da38978 R14: ffffffff999ae15a R15: > ffff98d880130105 [ 7283.170978] FS: 0000000000000000(0000) > GS:ffff98d996f00000(0000) knlGS:0000000000000000 [ 7283.170981] CS: > 0010 > DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7283.170983] CR2: > 00000000f7a9d178 CR3: 00000001c42ea000 CR4: 00000000003506e0 [ > 7283.170986] Call Trace: > [ 7283.170988] <TASK> > [ 7283.170989] gfx_v9_0_hw_fini+0x1c/0x6d0 [amdgpu] [ 7283.171655] > amdgpu_device_ip_suspend_phase2+0x101/0x1a0 [amdgpu] [ 7283.172245] > amdgpu_device_suspend+0x103/0x180 [amdgpu] [ 7283.172823] > amdgpu_pmops_freeze+0x21/0x60 [amdgpu] [ 7283.173412] > pci_pm_freeze+0x54/0xc0 [ 7283.173419] ? > __pfx_pci_pm_freeze+0x10/0x10 [ 7283.173425] > dpm_run_callback+0x98/0x200 [ 7283.173430] > __device_suspend+0x164/0x5f0 > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2522 > > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 ++- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > index ecf8ceb53311..f6bc62a94099 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > @@ -4442,7 +4442,8 @@ static int gfx_v11_0_hw_fini(void *handle) > struct amdgpu_device *adev = (struct amdgpu_device *)handle; > int r; > > - amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0); > + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) > + amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0); > amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0); > amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0); > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > index ae09fc1cfe6b..c54d05bdc2d8 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > @@ -3751,7 +3751,8 @@ static int gfx_v9_0_hw_fini(void *handle) { > struct amdgpu_device *adev = (struct amdgpu_device *)handle; > > - amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0); > + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) > + amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0); > amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0); > amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0); > > -- > 2.25.1