[AMD Official Use Only - General] Does this requires the new MES FW for this process_ctx_flush requirement ? Can driver side add logic to guaranty when call SET_SHADER_DEBUGGER, the process address is always valid ? Regards Shaoyun.liu -----Original Message----- From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Eric Huang Sent: Tuesday, December 12, 2023 12:49 PM To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Wong, Alice <Shiwei.Wong@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx> Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management On 2023-12-11 16:16, Jonathan Kim wrote: > MES provides the driver a call to explicitly flush stale process > memory within the MES to avoid a race condition that results in a > fatal memory violation. > > When SET_SHADER_DEBUGGER is called, the driver passes a memory address > that represents a process context address MES uses to keep track of > future per-process calls. > > Normally, MES will purge its process context list when the last queue > has been removed. The driver, however, can call SET_SHADER_DEBUGGER > regardless of whether a queue has been added or not. > > If SET_SHADER_DEBUGGER has been called with no queues as the last call > prior to process termination, the passed process context address will > still reside within MES. > > On a new process call to SET_SHADER_DEBUGGER, the driver may end up > passing an identical process context address value (based on > per-process gpu memory address) to MES but is now pointing to a new > allocated buffer object during KFD process creation. Since the MES is > unaware of this, access of the passed address points to the stale > object within MES and triggers a fatal memory violation. > > The solution is for KFD to explicitly flush the process context > address from MES on process termination. > > Note that the flush call and the MES debugger calls use the same MES > interface but are separated as KFD calls to avoid conflicting with > each other. > > Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx> > Tested-by: Alice Wong <shiwei.wong@xxxxxxx> Reviewed-by: Eric Huang <jinhuieric.huang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 +++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- > .../amd/amdkfd/kfd_process_queue_manager.c | 1 + > drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- > 4 files changed, 40 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > index e544b823abf6..e98de23250dc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev, > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > op_input.set_shader_debugger.process_context_addr = process_context_addr; > op_input.set_shader_debugger.flags.u32all = flags; > + > + /* use amdgpu mes_flush_shader_debugger instead */ > + if (op_input.set_shader_debugger.flags.process_ctx_flush) > + return -EINVAL; > + > op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = spi_gdbg_per_vmid_cntl; > memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl, > sizeof(op_input.set_shader_debugger.tcp_watch_cntl)); > @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev, > return r; > } > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev, > + uint64_t process_context_addr) { > + struct mes_misc_op_input op_input = {0}; > + int r; > + > + if (!adev->mes.funcs->misc_op) { > + DRM_ERROR("mes flush shader debugger is not supported!\n"); > + return -EINVAL; > + } > + > + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > + op_input.set_shader_debugger.process_context_addr = process_context_addr; > + op_input.set_shader_debugger.flags.process_ctx_flush = true; > + > + amdgpu_mes_lock(&adev->mes); > + > + r = adev->mes.funcs->misc_op(&adev->mes, &op_input); > + if (r) > + DRM_ERROR("failed to set_shader_debugger\n"); > + > + amdgpu_mes_unlock(&adev->mes); > + > + return r; > +} > + > static void > amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev, > struct amdgpu_ring *ring, > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > index 894b9b133000..7d4f93fea937 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > @@ -296,9 +296,10 @@ struct mes_misc_op_input { > uint64_t process_context_addr; > union { > struct { > - uint64_t single_memop : 1; > - uint64_t single_alu_op : 1; > - uint64_t reserved: 30; > + uint32_t single_memop : 1; > + uint32_t single_alu_op : 1; > + uint32_t reserved: 29; > + uint32_t process_ctx_flush: 1; > }; > uint32_t u32all; > } flags; > @@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev, > const uint32_t *tcp_watch_cntl, > uint32_t flags, > bool trap_en); > - > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev, > + uint64_t process_context_addr); > int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id, > int queue_type, int idx, > struct amdgpu_mes_ctx_data *ctx_data, diff --git > a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > index 77f493262e05..8e55e78fce4e 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd) > return; > > dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd); > + amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr); > pdd->already_dequeued = true; > } > > diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h > b/drivers/gpu/drm/amd/include/mes_v11_api_def.h > index 1fbfd1aa987e..ec5b9ab67c5e 100644 > --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h > +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h > @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER { > struct { > uint32_t single_memop : 1; /* SQ_DEBUG.single_memop */ > uint32_t single_alu_op : 1; /* SQ_DEBUG.single_alu_op */ > - uint32_t reserved : 30; > + uint32_t reserved : 29; > + uint32_t process_ctx_flush : 1; > }; > uint32_t u32all; > } flags;