RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES shader debugger

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

> -----Original Message-----
> From: Liang, Prike <Prike.Liang@xxxxxxx>
> Sent: Tuesday, January 14, 2025 12:15 AM
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Kim,
> Jonathan <Jonathan.Kim@xxxxxxx>; Liang, Prike <Prike.Liang@xxxxxxx>
> Subject: [PATCH] drm/amdgpu: validate process_context_addr for the MES shader
> debugger
>
> The following page fault was observed during the exit moment of the
> HIP test process. In this particular error case, the HIP test
> (./MemcpyPerformance -h) does not require the AQL queue. As a result,

I don't think this has anything to do with AQL compute specifically but a quirk in the KFD where the KFD only creates the process device mes context when adding the first queue.
Maybe we should move context creation from the KFD add_queue_mes call to KFD process device creation when MES is supported.
Strangely enough KGD has an amdgpu_mes_create_process call that doesn't seem to be used but does exactly this.
Otherwise, since the driver is always supposed to allocate and pass a valid context for any mes call, maybe it's better to do a context null check with some comments in the kfd_process_dequeue_from_all_devices call to reflect this quirk.

Jon

> the process_context_addr was not assigned when the KFD process was
> released, ultimately leading to this page fault during the execution of
> kfd_process_dequeue_from_all_devices().
>
> [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0
> ring:153 vmid:0 pasid:0)
> [345962.295333] amdgpu 0000:03:00.0: amdgpu:   in page starting at address
> 0x0000000000000000 from client 10
> [345962.295775] amdgpu 0000:03:00.0: amdgpu:
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
> [345962.296097] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: CPC
> (0x5)
> [345962.296394] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
> [345962.296633] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x1
> [345962.296876] amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
> [345962.297135] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x1
> [345962.297377] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
> [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0
> ring:169 vmid:0 pasid:0)
>
> Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index cee38bb6cfaf..4d313144cc4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -1062,6 +1062,11 @@ int amdgpu_mes_flush_shader_debugger(struct
> amdgpu_device *adev,
>               return -EINVAL;
>       }
>
> +     if (!process_context_addr) {
> +             dev_warn(adev->dev, "invalidated process context addr\n");
> +             return -EINVAL;
> +     }
> +
>       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
>       op_input.set_shader_debugger.process_context_addr =
> process_context_addr;
>       op_input.set_shader_debugger.flags.process_ctx_flush = true;
> --
> 2.34.1





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux