[Public] Thank you for the detailed review. I will incorporate this change in the line above before submitting the patch. Regards, Prike > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Kim, > Jonathan > Sent: Thursday, January 23, 2025 11:23 PM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Kasiviswanathan, Harish > <Harish.Kasiviswanathan@xxxxxxx> > Subject: RE: [PATCH] drm/amdkfd: only flush the validate MES contex > > [Public] > > [Public] > > > -----Original Message----- > > From: Liang, Prike <Prike.Liang@xxxxxxx> > > Sent: Wednesday, January 22, 2025 4:26 AM > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Kuehling, Felix > > <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; > > Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>; Liang, Prike > > <Prike.Liang@xxxxxxx> > > Subject: [PATCH] drm/amdkfd: only flush the validate MES contex > > > > The following page fault was observed duringthe KFD process release. > > In this particular error case, the HIP test (./MemcpyPerformance -h) > > does not require the queue. As a result, the process_context_addr was > > not assigned when the KFD process was released, ultimately leading to > > this page fault during the execution of kfd_process_dequeue_from_all_devices(). > > > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > > (src_id:0 > > ring:153 vmid:0 pasid:0) > > [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address > > 0x0000000000000000 from client 10 > > [345962.295775] amdgpu 0000:03:00.0: amdgpu: > > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > > [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: > CPC > > (0x5) > > [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 > > [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 > > [345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: > 0x3 > > [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 > > [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > > (src_id:0 > > ring:169 vmid:0 pasid:0) > > > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > index 9c2d8393cd4c..c39cdff58418 100644 > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > @@ -86,9 +86,13 @@ void kfd_process_dequeue_from_device(struct > > kfd_process_device *pdd) > > > > if (pdd->already_dequeued) > > return; > > - > > + /* The MES context flush needs to filter out the case which the > > + * KFD process is created without setting up the MES context and > > + * queue for creating a compute queue. > > + */ > > dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd); > > if (dev->kfd->shared_resources.enable_mes && > > + !!pdd->proc_ctx_gpu_addr && > > > You can probably combine this check in the line above since doing that would not > exceed the recommended line limit of 100 characters. > Otherwise, align the indentation of the new check under the previous line's check > for legibility. > > With that fixed: > Reviewed-by: Jonathan Kim <jonathan.kim@xxxxxxx> > > > down_read_trylock(&dev->adev->reset_domain->sem)) { > > amdgpu_mes_flush_shader_debugger(dev->adev, > > > > pdd->proc_ctx_gpu_addr); > > -- > > 2.34.1