[Public] > -----Original Message----- > From: Liang, Prike <Prike.Liang@xxxxxxx> > Sent: Wednesday, January 22, 2025 4:26 AM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; > Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>; Liang, Prike > <Prike.Liang@xxxxxxx> > Subject: [PATCH] drm/amdkfd: only flush the validate MES contex > > The following page fault was observed duringthe KFD process release. > In this particular error case, the HIP test (./MemcpyPerformance -h) > does not require the queue. As a result, the process_context_addr was > not assigned when the KFD process was released, ultimately leading to > this page fault during the execution of kfd_process_dequeue_from_all_devices(). > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 > ring:153 vmid:0 pasid:0) > [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address > 0x0000000000000000 from client 10 > [345962.295775] amdgpu 0000:03:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC > (0x5) > [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 > [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 > [345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 > [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 > [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 > ring:169 vmid:0 pasid:0) > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > index 9c2d8393cd4c..c39cdff58418 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > @@ -86,9 +86,13 @@ void kfd_process_dequeue_from_device(struct > kfd_process_device *pdd) > > if (pdd->already_dequeued) > return; > - > + /* The MES context flush needs to filter out the case which the > + * KFD process is created without setting up the MES context and > + * queue for creating a compute queue. > + */ > dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd); > if (dev->kfd->shared_resources.enable_mes && > + !!pdd->proc_ctx_gpu_addr && You can probably combine this check in the line above since doing that would not exceed the recommended line limit of 100 characters. Otherwise, align the indentation of the new check under the previous line's check for legibility. With that fixed: Reviewed-by: Jonathan Kim <jonathan.kim@xxxxxxx> > down_read_trylock(&dev->adev->reset_domain->sem)) { > amdgpu_mes_flush_shader_debugger(dev->adev, > pdd->proc_ctx_gpu_addr); > -- > 2.34.1