RE: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

"Kim, Jonathan" <Jonathan.Kim@xxxxxxx> · Fri, 14 Mar 2025 22:30:53 +0000

[Public]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Friday, March 14, 2025 5:46 PM
> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> Cc: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>
> Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by
> removing dynamic callbacks
>
> On Fri, Mar 14, 2025 at 10:43 AM Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> wrote:
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Alex Deucher <alexdeucher@xxxxxxxxx>
> > > Sent: Thursday, March 13, 2025 10:38 AM
> > > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> > > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
> > > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>;
> > > Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong
> > > <Jiadong.Zhu@xxxxxxx>
> > > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism
> by
> > > removing dynamic callbacks
> > >
> > > I think as long as the locking is correct, the src shouldn't matter.
> > > You just need to stop the kernel queues (and save state) and evict the
> > > user queues (since HWS is responsible for saving the MQDs of the
> > > non-guilty user queues).  If KFD detected the hang (e.g., queue
> > > eviction fails when called for memory pressure, etc.), we just need to
> > > make sure that it's ok for the sdma reset routine to call evict queues
> > > again even if it was already called (presumably it should exit early
> > > since the queues are already evicted).  If KGD initiates the reset, it
> > > will call into KFD to evict queues.  We just need to make sure the
> > > evict queues function we call just evicts the queues and doesn't also
> > > try and reset.
> >
> > If we're removing the src parameter, we need to be careful we don't end up in a
> double lock scenario in the KFD.
> > i.e. kgd inits reset -> kfd detects hang on kgd reset trigger and calls back to kgd -
> > amdgpu_amdkfd_suspend gets called again but is blocked on previous suspend
> call from original kgd reset (which is holding a bunch of KFD locks) while KFD is
> trying to clean up immediately.
> >
>
> How would this work even with the src parameter?  I think you could
> still get into that case.  E.g., KGD detects a hang and initiates an
> engine reset.  In parallel, KFD tries to unmap all queues for some
> reason and detects a hang, so it tries to reset SDMA.  Assuming there
> is a lock that protects SDMA reset, that should work.  However, it
> requires that the prerequisites on each side don't attempt to reset
> anything.

If the KGD succeeds in suspending the KFD, then KFD will be quiesced and no KFD reset is possible.
If the KFD is trying to reset inside a KGD initiated reset due to a KFD suspend call failure or is resetting in parallel, the current and any future KGD resets calls will be blocked on KFD suspend calls due to KFD locks (assuming suspend is the first action and suspend_user_queues == true).
KFD will then always reset with suspend_user_queues == false (and always hold a device/suspend or reset lock during this time), which avoids the KFD suspend on the KGD callback and avoids deadlocking itself.  Any other KGD future calls should be blocked on their own calls to KFD suspend due to KFD locking until the KFD is done resetting, avoiding any reset deadlocks.

>
> sdma_engine_reset()
> {
>
> KFD pre reset requirements
> 1. unmap all SDMA queues (CP firmware will save non-guilty state in MQDs)
> 2. detect guilty queue
>
> KGD pre reset requirements:
> 1. stop relevant drm schedulers
> 2. detect guilt queue
> 3. save non-guilty queue state
>
> Do engine reset
>
> KGD post reset requirements:
> 1. restore non-guilty queue state
> 2. start relevant drm schedulers
>
> KFD post reset requirements
> 1. map all non-guilty SDMA queues
>
> }
>
> I think what we need on the KFD side, if we don't have it already, is
> a function to just umap queues and update guilty state, but not to
> attempt to reset anything.  Then on the KFD side, in your normal
> flows, you could call this function to unamp queues and update queue
> state, and then after calling that walk through the queue state and
> trigger any resets based on queues flagged as bad.  On the KFD side,

I'd have to look at this more carefully, but it seems like you're suggesting the KFD to defer reset on preemption failure via scheduling similar to GPU resets.
The problem is future map/unmap request in the KFD are user requested so they're unpredictable.
GPU resets could just "flick-away" user queue update request failures subsequent to the initial preemption failure because the device is dead and all HSA processes have to die at some point.
You can't do this with SDMA resets since other processes are still good so the KFD would have to immediately lock itself down (which could be problematic during preemption routines) if it wants to unblock and schedule the reset later.

Jon

> in a normal flow you will have called this unmap and update state
> function and now you have a list of bad queues.  You can then initiate
> an engine reset for the engine the bad queue is on.  This is safe
> because you've already unmapped the queues, so if the unmap queues
> function gets called again from the sdma reset function, it will
> return early as the queues are already unmapped and marked guilty if
> they are.  The engine will reset, the reset sdma reset function will
> clean up the KGD side and then call the KFD map_queues().  Once it
> returns you are done.  If KGD detects the hang, it will call the sdma
> reset function and that will call the KFD unmap queues and update
> state function.  This will update the KFD side state, but not initiate
> any resets.  The engine will then be reset and then the KGD state will
> be restored and finally it will call the KFD map_queues() function to
> remap the non-guilty queues.  At completion both sides should be
> functional again.  I'm not too familiar with the KFD unmap and reset
> flows, but I think they will need to be decoupled if they are
> currently intermixed.
>
> Alex
>
> }
>
>
> > Safest way to remove the parameter seems like to move the KFD
> suspend/restore outside of the common call and have KGD call suspend/resume
> when it's calling the common call itself.
> >
>
>
>
> > Jon
> >
> > >
> > > Alex
> > >
> > > On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie)
> <Jesse.Zhang@xxxxxxx>
> > > wrote:
> > > >
> > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> > > > Sent: Wednesday, March 12, 2025 4:39 PM
> > > > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kim, Jonathan
> > > <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>
> > > > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism
> by
> > > removing dynamic callbacks
> > > >
> > > >
> > > >
> > > > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
> > > >
> > > > [SNIP9
> > > >
> > > > -
> > > >
> > > > +     gfx_ring->funcs->stop_queue(adev, instance_id);
> > > >
> > > >
> > > >
> > > > Yeah that starts to look good. Question here is who is calling
> > > amdgpu_sdma_reset_engine()?
> > > >
> > > >
> > > >
> > > > If this call comes from engine specific code we might not need the
> > > start/stop_queue callbacks all together.
> > > >
> > > >
> > > >
> > > >     Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and
> > > start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
> > > >
> > > >
> > > > Why would the KFD call this as well? Because it detects an issue with a
> SDMA
> > > user queue  If yes I would rather suggest that the KFD calls the reset function
> of
> > > the paging queue.
> > > >
> > > > Since this reset function is specific to the SDMA HW generation anyway you
> don't
> > > need those extra functions to abstract starting and stopping of the queue for
> each
> > > HW generation.
> > > >
> > > > kfd can't call reset function directly, unless we add a parameter src  to
> distinguish
> > > kfd and kgd in reset function, like this:
> > > >
> > > > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );
> > > >
> > > > As Alex said in another thread,
> > > >
> > > > We need to distinguish  kfd and kgd  in reset.
> > > >
> > > > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after
> reset.
> > > >
> > > > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps
> kfd
> > > needs to save its healthy jobs for recovery).
> > > >
> > > >
> > > >
> > > > If we can add a parameter, I am ok for that solution too.
> > > >
> > > >
> > > >
> > > > Additionally:
> > > >
> > > > For sdma6/7, when a queue reset fails, we may need to fall back to an engine
> > > reset for a attempt.
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Jesse
> > > >
> > > >
> > > > Regards,
> > > > Christian.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Jesse
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Christian.
> > > >
> > > >
> > > >
> > > >       /* Perform the SDMA reset for the specified instance */
> > > >
> > > >       ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
> > > >
> > > >       if (ret) {
> > > >
> > > > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct
> amdgpu_device
> > > *adev, uint32_t instance_id, b
> > > >
> > > >               goto exit;
> > > >
> > > >       }
> > > >
> > > >
> > > >
> > > > -     /* Invoke all registered post_reset callbacks */
> > > >
> > > > -     list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
> > > >
> > > > -             if (funcs->post_reset) {
> > > >
> > > > -                     ret = funcs->post_reset(adev, instance_id);
> > > >
> > > > -                     if (ret) {
> > > >
> > > > -                             dev_err(adev->dev,
> > > >
> > > > -                             "afterReset callback failed for instance %u: %d\n",
> > > >
> > > > -                                     instance_id, ret);
> > > >
> > > > -                             goto exit;
> > > >
> > > > -                     }
> > > >
> > > > -             }
> > > >
> > > > -     }
> > > >
> > > > +     gfx_ring->funcs->start_queue(adev, instance_id);
> > > >
> > > >
> > > >
> > > >  exit:
> > > >
> > > >       /* Restart the scheduler's work queue for the GFX and page rings
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > > >
> > > > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > > >
> > > > index fd34dc138081..c1f7ccff9c4e 100644
> > > >
> > > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > > >
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > > >
> > > > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs
> > > sdma_v4_4_2_ring_funcs = {
> > > >
> > > >       .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
> > > >
> > > >       .emit_reg_write_reg_wait =
> amdgpu_ring_emit_reg_write_reg_wait_helper,
> > > >
> > > >       .reset = sdma_v4_4_2_reset_queue,
> > > >
> > > > +     .stop_queue = sdma_v4_4_2_stop_queue,
> > > >
> > > > +     .start_queue = sdma_v4_4_2_restore_queue,
> > > >
> > > >       .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >