On 2022-08-17 10:01, Andrey Grodzovsky wrote:
On 2022-08-17 09:44, Alex Deucher wrote:
On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas <YiPeng.Chai@xxxxxxx>
wrote:
[AMD Official Use Only - General]
Hi Alex:
When removing an amdgpu device, it may be difficult to change the
order of psp_hw_fini calls.
1. The drm_dev_unplug call is at the beginning in the
amdgpu_pci_remove function, which makes the gpu device inaccessible
for userspace operations. If the call to psp_hw_fini was moved
before drm_dev_unplug, userspace could access the gpu device but
the psp might be removing. It has unknown issues.
+Andrey Grodzovsky
We should fix the ordering in amdgpu_pci_remove() then I guess? There
are lots of places where drm_dev_enter() is used to protect access to
the hardware which could be similarly affected.
Alex
We probably can try to move drm_dev_unplug after
amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must
be the first thing we do in amdgpu_pci_remove and what impact it will
have but maybe give it a try.
Also see if you can run libdrm hotplug test suite before and after the
change.
Andrey
Thinking a bit more about this - i guess the main problem with this will
be that in case of real hot unplug (which is hard to test unless you
have a real GPU cage with extenal GPU) this move will cause trying to
accesses HW registers
or MMIO ranges from VRAM BOs when HW is missing when you try to shut
down the HW in HW fini IP block specific callbacks , this in turn will
return garbage for reads (or all 1s maybe) which is what we probably
were trying to avoid by putting drm_dev_unplug as the first thing. So
it's probably a bit problematic.
Andrey
2. psp_hw_fini is called by the .hw_fini iterator in
amdgpu_device_ip_fini_early, referring to the code starting from
amdgpu_pci_remove to .hw_fini is called,
there are many preparatory operations before calling .hw_fini,
which makes it very difficult to change the order of psp_hw_fini or
all block .hw_fini.
So can we do a workaround in psp_cmd_submit_buf when removing
amdgpu device?
-----Original Message-----
From: Alex Deucher <alexdeucher@xxxxxxxxx>
Sent: Monday, August 15, 2022 10:22 PM
To: Chai, Thomas <YiPeng.Chai@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking
<Hawking.Zhang@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>; Chai,
Thomas <YiPeng.Chai@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually
sent to psp when amdgpu is uninstalled
On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai <YiPeng.Chai@xxxxxxx>
wrote:
The psp_cmd_submit_buf function is called by psp_hw_fini to send TA
unload messages to psp to terminate ras, asd and tmr.
But when amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
drm_dev_unplug
......
amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
->.hw_fini->psp_hw_fini->...
->psp_ta_unload->psp_cmd_submit_buf
......
}
The program will return when calling drm_dev_enter in
psp_cmd_submit_buf.
So the call to drm_dev_enter in psp_cmd_submit_buf should be removed,
so that the TA unload messages can be sent to the psp when amdgpu is
uninstalled.
Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index b067ce45d226..0578d8d094a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
if (psp->adev->no_hw_access)
return 0;
- if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
- return 0;
-
This check is to prevent the hardware from being accessed if the
card is removed. I think we need to fix the ordering elsewhere.
Alex
memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
memcpy(psp->cmd_buf_mem, cmd, sizeof(struct
psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct
psp_context *psp,
}
exit:
- drm_dev_exit(idx);
return ret;
}
--
2.25.1