[Public] > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx> > Sent: Friday, January 17, 2025 12:21 AM > To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Xiao, > Jack <Jack.Xiao@xxxxxxx> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; > Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx> > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES > shader debugger > > [Public] > > > -----Original Message----- > > From: Liang, Prike <Prike.Liang@xxxxxxx> > > Sent: Thursday, January 16, 2025 4:16 AM > > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Xiao, Jack <Jack.Xiao@xxxxxxx> > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > > <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> > > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the > > MES shader debugger > > > > [Public] > > > > > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx> > > > Sent: Wednesday, January 15, 2025 1:14 AM > > > To: Liang, Prike <Prike.Liang@xxxxxxx>; > > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > > > <Felix.Kuehling@xxxxxxx>; Koenig, Christian > > > <Christian.Koenig@xxxxxxx> > > > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for > > > the MES shader debugger > > > > > > [Public] > > > > > > > -----Original Message----- > > > > From: Liang, Prike <Prike.Liang@xxxxxxx> > > > > Sent: Tuesday, January 14, 2025 12:15 AM > > > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, > > > > Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian > > > > <Christian.Koenig@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; > > > > Liang, Prike <Prike.Liang@xxxxxxx> > > > > Subject: [PATCH] drm/amdgpu: validate process_context_addr for the > > > > MES shader debugger > > > > > > > > The following page fault was observed during the exit moment of > > > > the HIP test process. In this particular error case, the HIP test > > > > (./MemcpyPerformance -h) does not require the AQL queue. As a > > > > result, > > > > > > I don't think this has anything to do with AQL compute specifically > > > but a quirk in the KFD where the KFD only creates the process device > > > mes context when adding the first queue. > > > Maybe we should move context creation from the KFD add_queue_mes > > > call to > > KFD > > > process device creation when MES is supported. > > Looking at the git logs, the allocation got moved to queue creation recently which > is probably why we're seeing this bug now (added Jesse). > > Commit description: > drm/amdkfd: pause autosuspend when creating pdd > > When using MES creating a pdd will require talking to the GPU to > setup the relevant context. The code here forgot to wake up the GPU > in case it was in suspend, this causes KVM to EFAULT for passthrough > GPU for example. This issue can be masked if the GPU was woken up by > other things (e.g. opening the KMS node) first and have not yet gone to sleep. > > v4: do the allocation of proc_ctx_bo in a lazy fashion > when the first queue is created in a process (Felix) > > Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> > Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx> > > Maybe the solution should have been to transiently disable runtime power > management and leave allocation where it originally was. > set_shader_debugger (wrapper for mes flush) also gets called on debug attach, > which can be a queue-less call that requires MES process context. > Having to remember that MES process context creation happens at the queue > level is confusing. > > Jon Thank @Kim, Jonathan for the input, if the original issue can be resolved by exiting the runtime suspension at the kfd_create_process_device_data() then we can revert the regression commit. Thanks, Prike > > > > I didn't look much into what queue type is created in the HIP test: > > MemcpyPerformance, but the issue is that the KFD queue isn't created > > when executing MemcpyPerformance -h. This means that the MES context > > process isn't created yet in the KFD process that is released. In this > > case, the MES queue will not be set up. Do we need to create the MES process > at each KFD process creation? > > > > > Strangely enough KGD has an amdgpu_mes_create_process call that > > > doesn't > > seem > > > to be used but does exactly this. > > > > Yes, it seems like dead code and may be implemented for user queue > > cases in future usage. Maybe we can exclude > > amdgpu_mes_create_process() and > > amdgpu_mes_destroy_process() from compiling to reduce the kernel > > binary size. Hi, @Xiao, Jack could you please clarify why we need the > > unused function > > amdgpu_mes_create_process() here? > > > > > Otherwise, since the driver is always supposed to allocate and pass > > > a valid context for any mes call, maybe it's better to do a context > > > null check with some comments > > in > > > the kfd_process_dequeue_from_all_devices call to reflect this quirk. > > > > > Yes, I prefer to add the MES context check in the > > kfd_process_dequeue_from_all_devices() function rather than creating > > the MES context regardless of whether it is required to create the > > queue at each KFD process creation. > > > > Thanks, > > Prike > > > > > > > > > Jon > > > > > > > > > > > the process_context_addr was not assigned when the KFD process was > > > > released, ultimately leading to this page fault during the > > > > execution of kfd_process_dequeue_from_all_devices(). > > > > > > > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > > > > (src_id:0 > > > > ring:153 vmid:0 pasid:0) > > > > [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at > address > > > > 0x0000000000000000 from client 10 > > > > [345962.295775] amdgpu 0000:03:00.0: amdgpu: > > > > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > > > > [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: > > CPC > > > > (0x5) > > > > [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 > > > > [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 > > > > [345962.296876] amdgpu 0000:03:00.0: amdgpu: > PERMISSION_FAULTS: > > 0x3 > > > > [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: > 0x1 > > > > [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > > > > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > > > > (src_id:0 > > > > ring:169 vmid:0 pasid:0) > > > > > > > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx> > > > > --- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 +++++ > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > index cee38bb6cfaf..4d313144cc4b 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > @@ -1062,6 +1062,11 @@ int amdgpu_mes_flush_shader_debugger(struct > > > > amdgpu_device *adev, > > > > return -EINVAL; > > > > } > > > > > > > > + if (!process_context_addr) { > > > > + dev_warn(adev->dev, "invalidated process context addr\n"); > > > > + return -EINVAL; > > > > + } > > > > + > > > > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > > > op_input.set_shader_debugger.process_context_addr = > > > > process_context_addr; > > > > op_input.set_shader_debugger.flags.process_ctx_flush = true; > > > > -- > > > > 2.34.1 > > > > > >
Attachment:
0002-drm-amdkfd-exit-runtime-suspend-for-the-PDD.patch
Description: 0002-drm-amdkfd-exit-runtime-suspend-for-the-PDD.patch
Attachment:
0001-Revert-drm-amdkfd-pause-autosuspend-when-creating-pd.patch
Description: 0001-Revert-drm-amdkfd-pause-autosuspend-when-creating-pd.patch