[AMD Official Use Only - AMD Internal Distribution Only]
Right. That involves more changes from kfd to amdkfd interface to amdgpu ras interface. And need to consider reenabling unmap queue at some point. Let me think about more how to put these together
and make it be part of the upcoming ras series.
Regards,
Hawking
Hawking
-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Friday, September 6, 2024 17:47
To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhou1, Tao <Tao.Zhou1@xxxxxxx>
Subject: Re: [PATCH] drm/amdkfd: Select reset method for poison handling
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Friday, September 6, 2024 17:47
To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhou1, Tao <Tao.Zhou1@xxxxxxx>
Subject: Re: [PATCH] drm/amdkfd: Select reset method for poison handling
On 9/6/2024 1:42 PM, Hawking Zhang wrote:
> Driver mode-2 is only supported by relative new smc firmware.
>
> Signed-off-by: Hawking Zhang <Hawking.Zhang@xxxxxxx>
> ---
> .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 40 +++++++++++++++----
> 1 file changed, 32 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index fecdbbab9894..d46a13156ee9 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -167,11 +167,23 @@ static void event_interrupt_poison_consumption_v9(struct kfd_node *dev,
> case SOC15_IH_CLIENTID_SE3SH:
> case SOC15_IH_CLIENTID_UTCL2:
> block = AMDGPU_RAS_BLOCK__GFX;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
> - amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3)) {
> + /* driver mode-2 for gfx poison is only supported by
> + * pmfw 0x00557300 and onwards */
> + if (dev->adev->pm.fw_version < 0x00557300)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4)) {
> + /* driver mode-2 for gfx poison is only supported by
> + * pmfw 0x05550C00 and onwards */
> + if (dev->adev->pm.fw_version < 0x05550C00)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else {
> reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
I think it's better to handle this inside amdgpu_ras_do_recovery rather than here.
Something like -
int amdgpu_ras_reset_method_quirk(adev) which returns the right reset method when (ras->gpu_reset_flags & AMDGPU_RAS_GPU_RESET_MODE2_RESET) is set. Or add a few more flags like RAS_SDMA_POISON/RAS_GFX_POISON
and decide the method in amdgpu_ras handling.
Thanks,
Lijo
> break;
> case SOC15_IH_CLIENTID_VMC:
> case SOC15_IH_CLIENTID_VMC1:
> @@ -184,11 +196,23 @@ static void event_interrupt_poison_consumption_v9(struct kfd_node *dev,
> case SOC15_IH_CLIENTID_SDMA3:
> case SOC15_IH_CLIENTID_SDMA4:
> block = AMDGPU_RAS_BLOCK__SDMA;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
> - amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, SDMA0_HWIP, 0) == IP_VERSION(4, 4, 2)) {
> + /* driver mode-2 for gfx poison is only supported by
> + * pmfw 0x00557300 and onwards */
> + if (dev->adev->pm.fw_version < 0x00557300)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else if (amdgpu_ip_version(dev->adev, SDMA0_HWIP, 0) == IP_VERSION(4, 4, 5)) {
> + /* driver mode-2 for gfx poison is only supported by
> + * pmfw 0x05550C00 and onwards */
> + if (dev->adev->pm.fw_version < 0x05550C00)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else {
> reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
> break;
> default:
> dev_warn(dev->adev->dev,