Am 02.09.24 um 09:34 schrieb Lijo Lazar:
There are case where a device needs to be reset first before it is fully initialized. An example case is a driver reinstallation with a different version of PSP TOS. In such a case, if a device supports reset in which PSP TOS is unloaded, then driver needs to reset device first and then load the new firmware components. For devices in an XGMI hive, a reset needs to be sent on all devices in the hive. Thus driver should discover first devices that belong to a hive with PSP support. There is an existing delayed reset handler, however it has the below limitations- 1) It doesn't discover devices in the hive, instead it tries to do XGMI reset for all devices registered to mgpu struct. mgpu struct may have other devices than the one which belong to a hive. Also, if there is more than one hive, it doesn't work. 2) It doesn't take a reset lock and since this is a delayed reset, that could result in unwanted hardware accesses during a reset. 3) It doesn't initialize RAS properly (left as TODO) This series overcomes the above limitations. Instead of marking a pending reset, init levels are defined where the level of initialization may be defined. In case of a pending reset, only specific hardware blocks may be initialized. Further work (not done in this series) may be done to have fine grain controls for init levels - say skip enabling features like DPM enablement, or skip loading specific set of fimwares as they won't be required during a minimal init scenario where device is going to be reset. The series adds an API interface to check if a PSP TOS reload is required.
At least from the high level that sounds totally sane, but I have no idea where to get time from to review the details.
I need to discuss that with Alex and/or Tim. Maybe I can delegate some more work.
Christian.
Lijo Lazar (10): drm/amdgpu: Add init levels drm/amdgpu: Use init level for pending_reset flag drm/amdgpu: Separate reinitialization after reset drm/amdgpu: Add reset on init handler for XGMI drm/amdgpu: Add helper to initialize badpage info drm/amdgpu: Refactor XGMI reset on init handling drm/amdgpu: Drop delayed reset work handler drm/amdgpu: Support reset-on-init on select SOCs drm/amdgpu: Add interface for TOS reload cases drm/amdgpu: Add PSP reload case to reset-on-init drivers/gpu/drm/amd/amdgpu/aldebaran.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu.h | 21 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 245 +++++++++++------- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 81 ------ drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 13 + drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 62 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 148 +++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 72 ++++- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 2 + drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +- drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 25 ++ drivers/gpu/drm/amd/amdgpu/soc15.c | 7 + .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 3 +- 17 files changed, 492 insertions(+), 214 deletions(-)