[Public] > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx> > Sent: Wednesday, January 15, 2025 1:14 AM > To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES > shader debugger > > [Public] > > > -----Original Message----- > > From: Liang, Prike <Prike.Liang@xxxxxxx> > > Sent: Tuesday, January 14, 2025 12:15 AM > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > > <Felix.Kuehling@xxxxxxx>; Koenig, Christian > > <Christian.Koenig@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; > > Liang, Prike <Prike.Liang@xxxxxxx> > > Subject: [PATCH] drm/amdgpu: validate process_context_addr for the MES > > shader debugger > > > > The following page fault was observed during the exit moment of the > > HIP test process. In this particular error case, the HIP test > > (./MemcpyPerformance -h) does not require the AQL queue. As a result, > > I don't think this has anything to do with AQL compute specifically but a quirk in the > KFD where the KFD only creates the process device mes context when adding the > first queue. > Maybe we should move context creation from the KFD add_queue_mes call to KFD > process device creation when MES is supported. I didn't look much into what queue type is created in the HIP test: MemcpyPerformance, but the issue is that the KFD queue isn't created when executing MemcpyPerformance -h. This means that the MES context process isn't created yet in the KFD process that is released. In this case, the MES queue will not be set up. Do we need to create the MES process at each KFD process creation? > Strangely enough KGD has an amdgpu_mes_create_process call that doesn't seem > to be used but does exactly this. Yes, it seems like dead code and may be implemented for user queue cases in future usage. Maybe we can exclude amdgpu_mes_create_process() and amdgpu_mes_destroy_process() from compiling to reduce the kernel binary size. Hi, @Xiao, Jack could you please clarify why we need the unused function amdgpu_mes_create_process() here? > Otherwise, since the driver is always supposed to allocate and pass a valid context > for any mes call, maybe it's better to do a context null check with some comments in > the kfd_process_dequeue_from_all_devices call to reflect this quirk. > Yes, I prefer to add the MES context check in the kfd_process_dequeue_from_all_devices() function rather than creating the MES context regardless of whether it is required to create the queue at each KFD process creation. Thanks, Prike > Jon > > > the process_context_addr was not assigned when the KFD process was > > released, ultimately leading to this page fault during the execution > > of kfd_process_dequeue_from_all_devices(). > > > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > > (src_id:0 > > ring:153 vmid:0 pasid:0) > > [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address > > 0x0000000000000000 from client 10 > > [345962.295775] amdgpu 0000:03:00.0: amdgpu: > > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > > [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC > > (0x5) > > [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 > > [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 > > [345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 > > [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 > > [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > > (src_id:0 > > ring:169 vmid:0 pasid:0) > > > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > index cee38bb6cfaf..4d313144cc4b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > @@ -1062,6 +1062,11 @@ int amdgpu_mes_flush_shader_debugger(struct > > amdgpu_device *adev, > > return -EINVAL; > > } > > > > + if (!process_context_addr) { > > + dev_warn(adev->dev, "invalidated process context addr\n"); > > + return -EINVAL; > > + } > > + > > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > op_input.set_shader_debugger.process_context_addr = > > process_context_addr; > > op_input.set_shader_debugger.flags.process_ctx_flush = true; > > -- > > 2.34.1 >