RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

I see,  so the  problem is after process context , set_shader been  called with disable parameter,  do you know the  reason why  MES re-added the  process context into the  list ?

Shaoyun.liu

-----Original Message-----
From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
Sent: Tuesday, December 12, 2023 6:07 PM
To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -----Original Message-----
> From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> Sent: Tuesday, December 12, 2023 5:44 PM
> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Huang, JinHuiEric
> <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?  I
> think  even in that  situation MES should still be able to handle it
> as long as MES already  remove the process context from its list , MES
> will treat the process context as a new item. I still don't understand why MES haven't
> purged the  process context from the list after process termination .   Will
> debug queue itself  also use the add/remove queue interface  and  is
> it possible the debug queue itself from the  old process  still not be
> removed ?

SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
The process list is updated on either on SET_SHADER_DEBUGGER or ADD_QUEUE.
e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list purged) -> runtime_disable (set_shader process re-added) -> process termination (stale list) or debug attach (set_shader) -> add_queue -> remove_queue (list purged) -> debug detach (set_shader process re-added) ->process termination (stale list)

MES has no idea what process termination means.  The new flag is a proxy for this.
There are reasons for process settings to take place prior to queue add (debugger, gfx11 cwsr workaround, core dump etc need this).

I'm not sure what kernel/debug queues have to do with this.
By that argument, the list should be purged.

Jon

>
> Shaoyun.liu
>
> -----Original Message-----
> From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> Sent: Tuesday, December 12, 2023 4:48 PM
> To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric
> <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -----Original Message-----
> > From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> > Sent: Tuesday, December 12, 2023 4:45 PM
> > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Huang, JinHuiEric
> > <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > <Harish.Kasiviswanathan@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Shouldn't the driver side  remove all the remaining  queues for the
> > process during  process termination ?  If all the  queues been
> > removed for the process ,  MES should purge the  process context
> > automatically , otherwise it's bug inside MES .
>
> That's only if there were queues added to begin with.
>
> Jon
>
> >
> > Regard
> > Sshaoyun.liu
> >
> > -----Original Message-----
> > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> > Sent: Tuesday, December 12, 2023 4:33 PM
> > To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric
> > <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > <Harish.Kasiviswanathan@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> > > Sent: Tuesday, December 12, 2023 4:00 PM
> > > To: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; Kim, Jonathan
> > > <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > > <Harish.Kasiviswanathan@xxxxxxx>
> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger
> > > process management
> > >
> > > [AMD Official Use Only - General]
> > >
> > > Does this requires the  new MES FW for this process_ctx_flush
> > > requirement ?  Can driver side add logic to guaranty when  call
> > > SET_SHADER_DEBUGGER, the process address  is always valid ?
> >
> > Call to flush on old fw is a NOP so it's harmless in that case.
> > Full solution will still require a new MES version as this is a
> > workaround on corner cases and not a new feature i.e. we can't stop
> > ROCm from running on old fw.
> > The process address is always valid from the driver side.  It's the
> > MES side of things that gets stale as mentioned in the description
> > (passed value to MES is reused with new BO but MES doesn't refresh).
> > i.e. MES auto refreshes it's process list assuming process queues
> > were all drained but driver can't guarantee that SET_SHADER_DEBUGGER
> (which
> > adds to MES's process list) will get called after queues get added
> > (in fact it's a requirements that it can be called at any time).
> > We can attempt to defer calls these calls in the KFD, considering all cases.
> > But that would be a large shift in debugger/runtime_enable/KFD code,
> > which is already complicated and could get buggy plus it would not
> > be intuitive at all as to why we're doing this.
> > I think a single flag set to flush MES on process termination is a
> > simpler compromise that shows the limitation in a more obvious way.
> >
> > Thanks,
> >
> > Jon
> >
> >
> > >
> > > Regards
> > > Shaoyun.liu
> > >
> > >
> > > -----Original Message-----
> > > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> > > Eric Huang
> > > Sent: Tuesday, December 12, 2023 12:49 PM
> > > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-
> > > gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > > <Harish.Kasiviswanathan@xxxxxxx>
> > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger
> > > process management
> > >
> > >
> > > On 2023-12-11 16:16, Jonathan Kim wrote:
> > > > MES provides the driver a call to explicitly flush stale process
> > > > memory within the MES to avoid a race condition that results in
> > > > a fatal memory violation.
> > > >
> > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> > > address
> > > > that represents a process context address MES uses to keep track
> > > > of future per-process calls.
> > > >
> > > > Normally, MES will purge its process context list when the last
> > > > queue has been removed.  The driver, however, can call
> > > > SET_SHADER_DEBUGGER regardless of whether a queue has been
> added
> > or not.
> > > >
> > > > If SET_SHADER_DEBUGGER has been called with no queues as the
> > > > last call prior to process termination, the passed process
> > > > context address will still reside within MES.
> > > >
> > > > On a new process call to SET_SHADER_DEBUGGER, the driver may end
> > up
> > > > passing an identical process context address value (based on
> > > > per-process gpu memory address) to MES but is now pointing to a
> > > > new allocated buffer object during KFD process creation.  Since
> > > > the MES is unaware of this, access of the passed address points
> > > > to the stale object within MES and triggers a fatal memory violation.
> > > >
> > > > The solution is for KFD to explicitly flush the process context
> > > > address from MES on process termination.
> > > >
> > > > Note that the flush call and the MES debugger calls use the same
> > > > MES interface but are separated as KFD calls to avoid
> > > > conflicting with each other.
> > > >
> > > > Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
> > > > Tested-by: Alice Wong <shiwei.wong@xxxxxxx>
> > > Reviewed-by: Eric Huang <jinhuieric.huang@xxxxxxx>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c       | 31
> > > +++++++++++++++++++
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h       | 10 +++---
> > > >   .../amd/amdkfd/kfd_process_queue_manager.c    |  1 +
> > > >   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
> > > >   4 files changed, 40 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > index e544b823abf6..e98de23250dc 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > @@ -916,6 +916,11 @@ int
> amdgpu_mes_set_shader_debugger(struct
> > > amdgpu_device *adev,
> > > >       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > > >       op_input.set_shader_debugger.process_context_addr =
> > > process_context_addr;
> > > >       op_input.set_shader_debugger.flags.u32all = flags;
> > > > +
> > > > +     /* use amdgpu mes_flush_shader_debugger instead */
> > > > +     if (op_input.set_shader_debugger.flags.process_ctx_flush)
> > > > +             return -EINVAL;
> > > > +
> > > >       op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl =
> > > spi_gdbg_per_vmid_cntl;
> > > >       memcpy(op_input.set_shader_debugger.tcp_watch_cntl,
> > > tcp_watch_cntl,
> > > >
> > > > sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
> > > > @@ -935,6 +940,32 @@ int
> amdgpu_mes_set_shader_debugger(struct
> > > amdgpu_device *adev,
> > > >       return r;
> > > >   }
> > > >
> > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device
> *adev,
> > > > +                                  uint64_t process_context_addr) {
> > > > +     struct mes_misc_op_input op_input = {0};
> > > > +     int r;
> > > > +
> > > > +     if (!adev->mes.funcs->misc_op) {
> > > > +             DRM_ERROR("mes flush shader debugger is not
> supported!\n");
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > > +     op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > > > +     op_input.set_shader_debugger.process_context_addr =
> > > process_context_addr;
> > > > +     op_input.set_shader_debugger.flags.process_ctx_flush =
> > > > + true;
> > > > +
> > > > +     amdgpu_mes_lock(&adev->mes);
> > > > +
> > > > +     r = adev->mes.funcs->misc_op(&adev->mes, &op_input);
> > > > +     if (r)
> > > > +             DRM_ERROR("failed to set_shader_debugger\n");
> > > > +
> > > > +     amdgpu_mes_unlock(&adev->mes);
> > > > +
> > > > +     return r;
> > > > +}
> > > > +
> > > >   static void
> > > >   amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
> > > >                              struct amdgpu_ring *ring, diff
> > > > --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > index 894b9b133000..7d4f93fea937 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > @@ -296,9 +296,10 @@ struct mes_misc_op_input {
> > > >                       uint64_t process_context_addr;
> > > >                       union {
> > > >                               struct {
> > > > -                                     uint64_t single_memop : 1;
> > > > -                                     uint64_t single_alu_op : 1;
> > > > -                                     uint64_t reserved: 30;
> > > > +                                     uint32_t single_memop : 1;
> > > > +                                     uint32_t single_alu_op : 1;
> > > > +                                     uint32_t reserved: 29;
> > > > +                                     uint32_t process_ctx_flush:
> > > > + 1;
> > > >                               };
> > > >                               uint32_t u32all;
> > > >                       } flags;
> > > > @@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct
> > > amdgpu_device *adev,
> > > >                               const uint32_t *tcp_watch_cntl,
> > > >                               uint32_t flags,
> > > >                               bool trap_en);
> > > > -
> > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device
> *adev,
> > > > +                             uint64_t process_context_addr);
> > > >   int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id,
> > > >                       int queue_type, int idx,
> > > >                       struct amdgpu_mes_ctx_data *ctx_data, diff
> > > > --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > index 77f493262e05..8e55e78fce4e 100644
> > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct
> > > kfd_process_device *pdd)
> > > >               return;
> > > >
> > > >       dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> > > > +     amdgpu_mes_flush_shader_debugger(dev->adev, pdd-
> > > >proc_ctx_gpu_addr);
> > > >       pdd->already_dequeued = true;
> > > >   }
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > index 1fbfd1aa987e..ec5b9ab67c5e 100644
> > > > --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER {
> > > >               struct {
> > > >                       uint32_t single_memop : 1;  /*
> > > > SQ_DEBUG.single_memop
> > */
> > > >                       uint32_t single_alu_op : 1; /* SQ_DEBUG.single_alu_op */
> > > > -                     uint32_t reserved : 30;
> > > > +                     uint32_t reserved : 29;
> > > > +                     uint32_t process_ctx_flush : 1;
> > > >               };
> > > >               uint32_t u32all;
> > > >       } flags;
> > >
> >
> >
>
>






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux