[AMD Official Use Only - AMD Internal Distribution Only] -----Original Message----- From: Koenig, Christian <Christian.Koenig@xxxxxxx> Sent: Wednesday, March 12, 2025 4:07 PM To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx> Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks Am 12.03.25 um 08:59 schrieb Jesse.zhang@xxxxxxx: > From: "Jesse.zhang@xxxxxxx" <Jesse.zhang@xxxxxxx> > > Since KFD no longer registers its own callbacks for SDMA resets, and > only KGD uses the reset mechanism, we can simplify the SDMA reset flow by directly calling the ring's `stop_queue` and `start_queue` functions. > This patch removes the dynamic callback mechanism and prepares for its eventual deprecation. > > The current SDMA reset mechanism uses a dynamic callback registration system to allow both KFD and KGD to register pre-reset and post-reset functions. > However, KFD no longer registers its callbacks, and the reset logic is now entirely handled by KGD. > This makes the dynamic callback mechanism unnecessary and adds complexity to the code. > > 1. **Remove Dynamic Callbacks**: > - The `pre_reset` and `post_reset` callback invocations in `amdgpu_sdma_reset_engine` are removed. > - Instead, the ring's `stop_queue` and `start_queue` functions are called directly during the reset process. > > 2. **Add `stop_queue` and `start_queue` to Ring Functions**: > - The `amdgpu_ring_funcs` structure is extended to include `stop_queue` and `start_queue` functions. > - These functions are implemented in the SDMA version-specific code (e.g., `sdma_v4_4_2_stop_queue` and `sdma_v4_4_2_restore_queue`). > > 3. **Prepare for Deprecation of Dynamic Mechanism**: > - By removing the callback invocations, this patch prepares the codebase for the eventual removal of the dynamic callback registration mechanism. > > Signed-off-by: Jesse Zhang <Jesse.Zhang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 33 ++---------------------- > drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 2 ++ > 3 files changed, 6 insertions(+), 31 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > index b4fd1e17205e..1c52ff92ea26 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > @@ -237,6 +237,8 @@ struct amdgpu_ring_funcs { > void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset); > void (*patch_de)(struct amdgpu_ring *ring, unsigned offset); > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid); > + int (*stop_queue)(struct amdgpu_device *adev, uint32_t instance_id); > + int (*start_queue)(struct amdgpu_device *adev, uint32_t > +instance_id); > void (*emit_cleaner_shader)(struct amdgpu_ring *ring); > bool (*is_guilty)(struct amdgpu_ring *ring); }; diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c > index 39669f8788a7..b1f4a11ad0c9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c > @@ -534,16 +534,10 @@ void amdgpu_sdma_register_on_reset_callbacks(struct amdgpu_device *adev, struct > * @instance_id: ID of the SDMA engine instance to reset > * @suspend_user_queues: check if suspend user queue. > * > - * This function performs the following steps: > - * 1. Calls all registered pre_reset callbacks to allow KFD and AMDGPU to save their state. > - * 2. Resets the specified SDMA engine instance. > - * 3. Calls all registered post_reset callbacks to allow KFD and AMDGPU to restore their state. > - * > * Returns: 0 on success, or a negative error code on failure. > */ > int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t > instance_id, bool suspend_user_queues) { > - struct sdma_on_reset_funcs *funcs; > int ret = 0; > struct amdgpu_sdma_instance *sdma_instance = &adev->sdma.instance[instance_id];; > struct amdgpu_ring *gfx_ring = &sdma_instance->ring; @@ -571,19 > +565,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id, b > page_sched_stopped = true; > } > > - /* Invoke all registered pre_reset callbacks */ > - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) { > - if (funcs->pre_reset) { > - ret = funcs->pre_reset(adev, instance_id); > - if (ret) { > - dev_err(adev->dev, > - "beforeReset callback failed for instance %u: %d\n", > - instance_id, ret); > - goto exit; > - } > - } > - } > - > + gfx_ring->funcs->stop_queue(adev, instance_id); Yeah that starts to look good. Question here is who is calling amdgpu_sdma_reset_engine()? If this call comes from engine specific code we might not need the start/stop_queue callbacks all together. Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2. Thanks Jesse Regards, Christian. > /* Perform the SDMA reset for the specified instance */ > ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id); > if (ret) { > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id, b > goto exit; > } > > - /* Invoke all registered post_reset callbacks */ > - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) { > - if (funcs->post_reset) { > - ret = funcs->post_reset(adev, instance_id); > - if (ret) { > - dev_err(adev->dev, > - "afterReset callback failed for instance %u: %d\n", > - instance_id, ret); > - goto exit; > - } > - } > - } > + gfx_ring->funcs->start_queue(adev, instance_id); > > exit: > /* Restart the scheduler's work queue for the GFX and page rings > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > index fd34dc138081..c1f7ccff9c4e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = { > .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait, > .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper, > .reset = sdma_v4_4_2_reset_queue, > + .stop_queue = sdma_v4_4_2_stop_queue, > + .start_queue = sdma_v4_4_2_restore_queue, > .is_guilty = sdma_v4_4_2_ring_is_guilty, }; >