Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2022-08-17 10:01, Andrey Grodzovsky wrote:

On 2022-08-17 09:44, Alex Deucher wrote:
On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas <YiPeng.Chai@xxxxxxx> wrote:
[AMD Official Use Only - General]

Hi Alex:
   When removing an amdgpu device, it may be difficult to change the order of psp_hw_fini calls.

1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove function,  which makes the gpu device inaccessible for userspace operations.  If the call to psp_hw_fini was moved before drm_dev_unplug,  userspace could access the gpu device but the psp might be removing. It has unknown issues.

+Andrey Grodzovsky

We should fix the ordering in amdgpu_pci_remove() then I guess? There
are lots of places where drm_dev_enter() is used to protect access to
the hardware which could be similarly affected.

Alex


We probably can try to move drm_dev_unplug after amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must be the first thing we do in amdgpu_pci_remove and what impact it will have but maybe give it a try. Also see if you can run libdrm hotplug test suite before and after the change.

Andrey


Thinking a bit more about this - i guess the main problem with this will be that in case of real hot unplug (which is hard to test unless you have a real GPU cage with extenal GPU) this move will cause trying to accesses HW registers or MMIO ranges from VRAM BOs when HW is missing when you try to shut down the HW in HW fini IP block specific callbacks , this in turn will return garbage for reads (or all 1s maybe) which is what we probably were trying to avoid by putting drm_dev_unplug as the first thing. So it's probably a bit problematic.

Andrey





2. psp_hw_fini is called by the .hw_fini iterator in amdgpu_device_ip_fini_early, referring to the code starting from amdgpu_pci_remove to .hw_fini is called,     there are many preparatory operations before calling .hw_fini,  which makes it very difficult to change the order of psp_hw_fini or all block .hw_fini.

    So can we do a workaround in psp_cmd_submit_buf when removing amdgpu device?

-----Original Message-----
From: Alex Deucher <alexdeucher@xxxxxxxxx>
Sent: Monday, August 15, 2022 10:22 PM
To: Chai, Thomas <YiPeng.Chai@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai <YiPeng.Chai@xxxxxxx> wrote:
The psp_cmd_submit_buf function is called by psp_hw_fini to send TA
unload messages to psp to terminate ras, asd and tmr.
But when amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
         drm_dev_unplug
         ......
amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
                 ->.hw_fini->psp_hw_fini->...
                 ->psp_ta_unload->psp_cmd_submit_buf
         ......
}
The program will return when calling drm_dev_enter in
psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be removed,
so that the TA unload messages can be sent to the psp when amdgpu is
uninstalled.

Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 ----
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index b067ce45d226..0578d8d094a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
         if (psp->adev->no_hw_access)
                 return 0;

-       if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
-               return 0;
-
This check is to prevent the hardware from being accessed if the card is removed.  I think we need to fix the ordering elsewhere.

Alex

         memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);

         memcpy(psp->cmd_buf_mem, cmd, sizeof(struct
psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
         }

  exit:
-       drm_dev_exit(idx);
         return ret;
  }

--
2.25.1




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux