RE: [PATCH] drm/amdkfd: only flush the validate MES contex

"Liang, Prike" <Prike.Liang@xxxxxxx> · Sun, 26 Jan 2025 03:00:56 +0000

[Public]

Thank you for the detailed review. I will incorporate this change in the line above before submitting the patch.

Regards,
      Prike

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Kim,
> Jonathan
> Sent: Thursday, January 23, 2025 11:23 PM
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>
> Subject: RE: [PATCH] drm/amdkfd: only flush the validate MES contex
>
> [Public]
>
> [Public]
>
> > -----Original Message-----
> > From: Liang, Prike <Prike.Liang@xxxxxxx>
> > Sent: Wednesday, January 22, 2025 4:26 AM
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>;
> > Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>; Liang, Prike
> > <Prike.Liang@xxxxxxx>
> > Subject: [PATCH] drm/amdkfd: only flush the validate MES contex
> >
> > The following page fault was observed duringthe KFD process release.
> > In this particular error case, the HIP test (./MemcpyPerformance -h)
> > does not require the queue. As a result, the process_context_addr was
> > not assigned when the KFD process was released, ultimately leading to
> > this page fault during the execution of kfd_process_dequeue_from_all_devices().
> >
> > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> > (src_id:0
> > ring:153 vmid:0 pasid:0)
> > [345962.295333] amdgpu 0000:03:00.0: amdgpu:   in page starting at address
> > 0x0000000000000000 from client 10
> > [345962.295775] amdgpu 0000:03:00.0: amdgpu:
> > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
> > [345962.296097] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID:
> CPC
> > (0x5)
> > [345962.296394] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
> > [345962.296633] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x1
> > [345962.296876] amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS:
> 0x3
> > [345962.297135] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x1
> > [345962.297377] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
> > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> > (src_id:0
> > ring:169 vmid:0 pasid:0)
> >
> > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > index 9c2d8393cd4c..c39cdff58418 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > @@ -86,9 +86,13 @@ void kfd_process_dequeue_from_device(struct
> > kfd_process_device *pdd)
> >
> >       if (pdd->already_dequeued)
> >               return;
> > -
> > +     /* The MES context flush needs to filter out the case which the
> > +      * KFD process is created without setting up the MES context and
> > +      * queue for creating a compute queue.
> > +      */
> >       dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> >       if (dev->kfd->shared_resources.enable_mes &&
> > +                     !!pdd->proc_ctx_gpu_addr &&
>
>
> You can probably combine this check in the line above since doing that would not
> exceed the recommended line limit of 100 characters.
> Otherwise, align the indentation of the new check under the previous line's check
> for legibility.
>
> With that fixed:
> Reviewed-by: Jonathan Kim <jonathan.kim@xxxxxxx>
>
> >           down_read_trylock(&dev->adev->reset_domain->sem)) {
> >               amdgpu_mes_flush_shader_debugger(dev->adev,
> >
> > pdd->proc_ctx_gpu_addr);
> > --
> > 2.34.1