[AMD Official Use Only - General]
Hey,
Due to mode-1 reset (pending_reset), the amdgpu_amdkfd_device_init will not be called and hence adev->kfd.init_complete will not be set. The function amdgpu_amdkfd_drm_client_create has condition:if (!adev->kfd.init_complete)return 0;So, in probe function, when we return from device_init the KFD is not initialized and amdgpu_amdkfd_drm_client_create returns without doing anything.
I think your change could result in calling amdgpu_amdkfd_drm_client_create multiple times. IIRC, one purpose of moving the call to amdgpu_pci_probe was to ensure that it is only called once, because it only gets unregistered once when the driver is unloaded. Maybe it would be better to remove the if (!adev->kfd.init_complete) condition from amdgpu_amdkfd_drm_client_create. That way we would always create the client at probe and it would be ready when it's needed after the GPU reset.
There is a chance that the client would get created unnecessarily if KFD init never succeeds. But that should be rare, and it's not a big resource waste.
There were some comments on a previous code review, that creating the DRM client too early could cause problems. But I don't understand what that problem could be. As I understand it, the adev->kfd.client is just a place to put GEM handles for KFD BOs that we don't want to expose to user mode. I see no harm in creating this client too early or when it's not needed.
Regards,
Felix
Thanks,Ahmad
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Monday, March 4, 2024 6:39 PM
To: Rehman, Ahmad <Ahmad.Rehman@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
Cc: Wan, Gavin <Gavin.Wan@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload
On 2024-03-04 17:05, Ahmad Rehman wrote:
> In passthrough environment, when amdgpu is reloaded after unload, mode-1
> is triggered after initializing the necessary IPs, That init does not
> include KFD, and KFD init waits until the reset is completed. KFD init
> is called in the reset handler, but in this case, the zone device and
> drm client is not initialized, causing app to create kernel panic.
>
> Signed-off-by: Ahmad Rehman <Ahmad.Rehman@xxxxxxx>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 15b188aaf681..80b9642f2bc4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2479,8 +2479,11 @@ static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
> }
> for (i = 0; i < mgpu_info.num_dgpu; i++) {
> adev = mgpu_info.gpu_ins[i].adev;
> - if (!adev->kfd.init_complete)
> + if (!adev->kfd.init_complete) {
> + kgd2kfd_init_zone_device(adev);
> amdgpu_amdkfd_device_init(adev);
> + amdgpu_amdkfd_drm_client_create(adev);
I don't see what's preventing the DRM client initialization in the
reset-on-driver-load case. It only needs to be created once and that
happens in amdgpu_pci_probe. Am I missing anything?
Regards,
Felix
> + }
> amdgpu_ttm_set_buffer_funcs_status(adev, true);
> }
> }