[Public] > -----Original Message----- > From: Liang, Prike <Prike.Liang@xxxxxxx> > Sent: Tuesday, January 14, 2025 12:15 AM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Kim, > Jonathan <Jonathan.Kim@xxxxxxx>; Liang, Prike <Prike.Liang@xxxxxxx> > Subject: [PATCH] drm/amdgpu: validate process_context_addr for the MES shader > debugger > > The following page fault was observed during the exit moment of the > HIP test process. In this particular error case, the HIP test > (./MemcpyPerformance -h) does not require the AQL queue. As a result, I don't think this has anything to do with AQL compute specifically but a quirk in the KFD where the KFD only creates the process device mes context when adding the first queue. Maybe we should move context creation from the KFD add_queue_mes call to KFD process device creation when MES is supported. Strangely enough KGD has an amdgpu_mes_create_process call that doesn't seem to be used but does exactly this. Otherwise, since the driver is always supposed to allocate and pass a valid context for any mes call, maybe it's better to do a context null check with some comments in the kfd_process_dequeue_from_all_devices call to reflect this quirk. Jon > the process_context_addr was not assigned when the KFD process was > released, ultimately leading to this page fault during the execution of > kfd_process_dequeue_from_all_devices(). > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 > ring:153 vmid:0 pasid:0) > [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address > 0x0000000000000000 from client 10 > [345962.295775] amdgpu 0000:03:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC > (0x5) > [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 > [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 > [345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 > [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 > [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 > ring:169 vmid:0 pasid:0) > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > index cee38bb6cfaf..4d313144cc4b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > @@ -1062,6 +1062,11 @@ int amdgpu_mes_flush_shader_debugger(struct > amdgpu_device *adev, > return -EINVAL; > } > > + if (!process_context_addr) { > + dev_warn(adev->dev, "invalidated process context addr\n"); > + return -EINVAL; > + } > + > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > op_input.set_shader_debugger.process_context_addr = > process_context_addr; > op_input.set_shader_debugger.flags.process_ctx_flush = true; > -- > 2.34.1