RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]


Check the MES API,  It’s  my fault , originally I think it will use  trap_en as parameter to tell MES to enable/disable shader debugger , but actually it’s not .  So we either need to add a parameter for this (ex . add flag for enable/disable)  or as your solution add flag for flush.  Consider it’s possible user process can  be killed after call set_shader but before any add_queue , then notify MES to do a process context flush after process termination seems  more reasonable .

 

Regards

Shaoyun.liu

 

 

 

From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
Sent: Tuesday, December 12, 2023 11:43 PM
To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

 

[Public]

 

Again, MES only knows to flush if there was something enqueued in the first place. 

SET_SHADER dictates what’s on the process list. 

SET_SHADER can be the last call prior to process termination with nothing enqueued, hence no MES auto flush occurs.

 

MES doesn’t block anything on the flush flag request. 

The driver guarantees that flush is only done on process termination after device dequeue, whether there were queues or not.

MES has no idea what an invalid context is. 

It just has a value stored in its linked list that’s associated with a driver allocated BO that no longer exists after process termination.

 

If you’re still not sure about this solution, then this should be discussed offline with the MES team.

We’re not going to gain ground discussing this here.  The solution has already been merged.

Feel free to propose a better solution if you’re not satisfied with this one.

 

Jon

 

From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
Sent: Tuesday, December 12, 2023 11:08 PM
To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management

 

[Public]

 

You try to add one new interface to inform mes about the context flush after driver side finish process termination , from my understanding, mes already know the process context need to be purged after all the related queues been removed even without this notification. What do you expect mes to do about this context flush flag ? Mes should block this process context for next set_sched command? Mes can achieve  this by ignore the set_sched command with trap disable parameter on an invalid process context .

 

Shaoyun.liu


From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
Sent: Tuesday, December 12, 2023 8:19:09 PM
To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

 

[Public]

> -----Original Message-----
> From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Huang, JinHuiEric
> <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the necessary per-VMID (process) registers as requested by the driver, which requires persistent per-process HW settings so that potential future waves can inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why ADD_QUEUE auto clears the process context otherwise is another long story, basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or transiently dequeues the HWS per-VMID depending on the request settings -> fulfils the per-VMID register write updates -> resumes process queues so that potential waves on those queues inherit new debug settings.

You can't do this kind of operation at the queue level alone.

The problem that this patch solves (along with the MES FW upgrade) is an unfortunate quirk of having to operate between process (debug requests) and queue space (non-debug requests).
Old HWS used to operate at the per-process level via MAP_PROCESS so it was a lot easier to balance debug versus non-debug requests back then (but it was also lot less efficient performance wise).

Jon

>
> Shaoyun.liu
>
> -----Original Message-----
> From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> Sent: Tuesday, December 12, 2023 6:07 PM
> To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric
> <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -----Original Message-----
> > From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> > Sent: Tuesday, December 12, 2023 5:44 PM
> > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Huang, JinHuiEric
> > <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > <Harish.Kasiviswanathan@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?
> I
> > think  even in that  situation MES should still be able to handle it
> > as long as MES already  remove the process context from its list , MES
> > will treat the process context as a new item. I still don't understand why
> MES haven't
> > purged the  process context from the list after process termination .   Will
> > debug queue itself  also use the add/remove queue interface  and  is
> > it possible the debug queue itself from the  old process  still not be
> > removed ?
>
> SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
> The process list is updated on either on SET_SHADER_DEBUGGER or
> ADD_QUEUE.
> e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list
> purged) -> runtime_disable (set_shader process re-added) -> process
> termination (stale list) or debug attach (set_shader) -> add_queue ->
> remove_queue (list purged) -> debug detach (set_shader process re-added) -
> >process termination (stale list)
>
> MES has no idea what process termination means.  The new flag is a proxy
> for this.
> There are reasons for process settings to take place prior to queue add
> (debugger, gfx11 cwsr workaround, core dump etc need this).
>
> I'm not sure what kernel/debug queues have to do with this.
> By that argument, the list should be purged.
>
> Jon
>
> >
> > Shaoyun.liu
> >
> > -----Original Message-----
> > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> > Sent: Tuesday, December 12, 2023 4:48 PM
> > To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric
> > <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > <Harish.Kasiviswanathan@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> > > Sent: Tuesday, December 12, 2023 4:45 PM
> > > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Huang, JinHuiEric
> > > <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > > <Harish.Kasiviswanathan@xxxxxxx>
> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > > management
> > >
> > > [Public]
> > >
> > > Shouldn't the driver side  remove all the remaining  queues for the
> > > process during  process termination ?  If all the  queues been
> > > removed for the process ,  MES should purge the  process context
> > > automatically , otherwise it's bug inside MES .
> >
> > That's only if there were queues added to begin with.
> >
> > Jon
> >
> > >
> > > Regard
> > > Sshaoyun.liu
> > >
> > > -----Original Message-----
> > > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> > > Sent: Tuesday, December 12, 2023 4:33 PM
> > > To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; Huang, JinHuiEric
> > > <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > > <Harish.Kasiviswanathan@xxxxxxx>
> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > > management
> > >
> > > [Public]
> > >
> > > > -----Original Message-----
> > > > From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>
> > > > Sent: Tuesday, December 12, 2023 4:00 PM
> > > > To: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; Kim, Jonathan
> > > > <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > > > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > > > <Harish.Kasiviswanathan@xxxxxxx>
> > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger
> > > > process management
> > > >
> > > > [AMD Official Use Only - General]
> > > >
> > > > Does this requires the  new MES FW for this process_ctx_flush
> > > > requirement ?  Can driver side add logic to guaranty when  call
> > > > SET_SHADER_DEBUGGER, the process address  is always valid ?
> > >
> > > Call to flush on old fw is a NOP so it's harmless in that case.
> > > Full solution will still require a new MES version as this is a
> > > workaround on corner cases and not a new feature i.e. we can't stop
> > > ROCm from running on old fw.
> > > The process address is always valid from the driver side.  It's the
> > > MES side of things that gets stale as mentioned in the description
> > > (passed value to MES is reused with new BO but MES doesn't refresh).
> > > i.e. MES auto refreshes it's process list assuming process queues
> > > were all drained but driver can't guarantee that
> SET_SHADER_DEBUGGER
> > (which
> > > adds to MES's process list) will get called after queues get added
> > > (in fact it's a requirements that it can be called at any time).
> > > We can attempt to defer calls these calls in the KFD, considering all
> cases.
> > > But that would be a large shift in debugger/runtime_enable/KFD code,
> > > which is already complicated and could get buggy plus it would not
> > > be intuitive at all as to why we're doing this.
> > > I think a single flag set to flush MES on process termination is a
> > > simpler compromise that shows the limitation in a more obvious way.
> > >
> > > Thanks,
> > >
> > > Jon
> > >
> > >
> > > >
> > > > Regards
> > > > Shaoyun.liu
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> > > > Eric Huang
> > > > Sent: Tuesday, December 12, 2023 12:49 PM
> > > > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-
> > > > gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix
> > > > <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish
> > > > <Harish.Kasiviswanathan@xxxxxxx>
> > > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger
> > > > process management
> > > >
> > > >
> > > > On 2023-12-11 16:16, Jonathan Kim wrote:
> > > > > MES provides the driver a call to explicitly flush stale process
> > > > > memory within the MES to avoid a race condition that results in
> > > > > a fatal memory violation.
> > > > >
> > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> > > > address
> > > > > that represents a process context address MES uses to keep track
> > > > > of future per-process calls.
> > > > >
> > > > > Normally, MES will purge its process context list when the last
> > > > > queue has been removed.  The driver, however, can call
> > > > > SET_SHADER_DEBUGGER regardless of whether a queue has been
> > added
> > > or not.
> > > > >
> > > > > If SET_SHADER_DEBUGGER has been called with no queues as the
> > > > > last call prior to process termination, the passed process
> > > > > context address will still reside within MES.
> > > > >
> > > > > On a new process call to SET_SHADER_DEBUGGER, the driver may
> end
> > > up
> > > > > passing an identical process context address value (based on
> > > > > per-process gpu memory address) to MES but is now pointing to a
> > > > > new allocated buffer object during KFD process creation.  Since
> > > > > the MES is unaware of this, access of the passed address points
> > > > > to the stale object within MES and triggers a fatal memory violation.
> > > > >
> > > > > The solution is for KFD to explicitly flush the process context
> > > > > address from MES on process termination.
> > > > >
> > > > > Note that the flush call and the MES debugger calls use the same
> > > > > MES interface but are separated as KFD calls to avoid
> > > > > conflicting with each other.
> > > > >
> > > > > Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
> > > > > Tested-by: Alice Wong <shiwei.wong@xxxxxxx>
> > > > Reviewed-by: Eric Huang <jinhuieric.huang@xxxxxxx>
> > > > > ---
> > > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c       | 31
> > > > +++++++++++++++++++
> > > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h       | 10 +++---
> > > > >   .../amd/amdkfd/kfd_process_queue_manager.c    |  1 +
> > > > >   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
> > > > >   4 files changed, 40 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > > index e544b823abf6..e98de23250dc 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > > @@ -916,6 +916,11 @@ int
> > amdgpu_mes_set_shader_debugger(struct
> > > > amdgpu_device *adev,
> > > > >       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > > > >       op_input.set_shader_debugger.process_context_addr =
> > > > process_context_addr;
> > > > >       op_input.set_shader_debugger.flags.u32all = flags;
> > > > > +
> > > > > +     /* use amdgpu mes_flush_shader_debugger instead */
> > > > > +     if (op_input.set_shader_debugger.flags.process_ctx_flush)
> > > > > +             return -EINVAL;
> > > > > +
> > > > >       op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl =
> > > > spi_gdbg_per_vmid_cntl;
> > > > >       memcpy(op_input.set_shader_debugger.tcp_watch_cntl,
> > > > tcp_watch_cntl,
> > > > >
> > > > > sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
> > > > > @@ -935,6 +940,32 @@ int
> > amdgpu_mes_set_shader_debugger(struct
> > > > amdgpu_device *adev,
> > > > >       return r;
> > > > >   }
> > > > >
> > > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device
> > *adev,
> > > > > +                                  uint64_t process_context_addr) {
> > > > > +     struct mes_misc_op_input op_input = {0};
> > > > > +     int r;
> > > > > +
> > > > > +     if (!adev->mes.funcs->misc_op) {
> > > > > +             DRM_ERROR("mes flush shader debugger is not
> > supported!\n");
> > > > > +             return -EINVAL;
> > > > > +     }
> > > > > +
> > > > > +     op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > > > > +     op_input.set_shader_debugger.process_context_addr =
> > > > process_context_addr;
> > > > > +     op_input.set_shader_debugger.flags.process_ctx_flush =
> > > > > + true;
> > > > > +
> > > > > +     amdgpu_mes_lock(&adev->mes);
> > > > > +
> > > > > +     r = adev->mes.funcs->misc_op(&adev->mes, &op_input);
> > > > > +     if (r)
> > > > > +             DRM_ERROR("failed to set_shader_debugger\n");
> > > > > +
> > > > > +     amdgpu_mes_unlock(&adev->mes);
> > > > > +
> > > > > +     return r;
> > > > > +}
> > > > > +
> > > > >   static void
> > > > >   amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
> > > > >                              struct amdgpu_ring *ring, diff
> > > > > --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > > index 894b9b133000..7d4f93fea937 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > > > > @@ -296,9 +296,10 @@ struct mes_misc_op_input {
> > > > >                       uint64_t process_context_addr;
> > > > >                       union {
> > > > >                               struct {
> > > > > -                                     uint64_t single_memop : 1;
> > > > > -                                     uint64_t single_alu_op : 1;
> > > > > -                                     uint64_t reserved: 30;
> > > > > +                                     uint32_t single_memop : 1;
> > > > > +                                     uint32_t single_alu_op : 1;
> > > > > +                                     uint32_t reserved: 29;
> > > > > +                                     uint32_t process_ctx_flush:
> > > > > + 1;
> > > > >                               };
> > > > >                               uint32_t u32all;
> > > > >                       } flags;
> > > > > @@ -374,7 +375,8 @@ int
> amdgpu_mes_set_shader_debugger(struct
> > > > amdgpu_device *adev,
> > > > >                               const uint32_t *tcp_watch_cntl,
> > > > >                               uint32_t flags,
> > > > >                               bool trap_en);
> > > > > -
> > > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device
> > *adev,
> > > > > +                             uint64_t process_context_addr);
> > > > >   int amdgpu_mes_add_ring(struct amdgpu_device *adev, int
> gang_id,
> > > > >                       int queue_type, int idx,
> > > > >                       struct amdgpu_mes_ctx_data *ctx_data, diff
> > > > > --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > > index 77f493262e05..8e55e78fce4e 100644
> > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > > @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct
> > > > kfd_process_device *pdd)
> > > > >               return;
> > > > >
> > > > >       dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> > > > > +     amdgpu_mes_flush_shader_debugger(dev->adev, pdd-
> > > > >proc_ctx_gpu_addr);
> > > > >       pdd->already_dequeued = true;
> > > > >   }
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > > b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > > index 1fbfd1aa987e..ec5b9ab67c5e 100644
> > > > > --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > > +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > > > > @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER {
> > > > >               struct {
> > > > >                       uint32_t single_memop : 1;  /*
> > > > > SQ_DEBUG.single_memop
> > > */
> > > > >                       uint32_t single_alu_op : 1; /* SQ_DEBUG.single_alu_op
> */
> > > > > -                     uint32_t reserved : 30;
> > > > > +                     uint32_t reserved : 29;
> > > > > +                     uint32_t process_ctx_flush : 1;
> > > > >               };
> > > > >               uint32_t u32all;
> > > > >       } flags;
> > > >
> > >
> > >
> >
> >
>
>


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux