Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

Alex Deucher <alexdeucher@xxxxxxxxx> · Fri, 14 Mar 2025 17:46:27 -0400

On Fri, Mar 14, 2025 at 10:43 AM Kim, Jonathan <Jonathan.Kim@xxxxxxx> wrote:
>
> [Public]
>
> > -----Original Message-----
> > From: Alex Deucher <alexdeucher@xxxxxxxxx>
> > Sent: Thursday, March 13, 2025 10:38 AM
> > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
> > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>;
> > Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong
> > <Jiadong.Zhu@xxxxxxx>
> > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by
> > removing dynamic callbacks
> >
> > I think as long as the locking is correct, the src shouldn't matter.
> > You just need to stop the kernel queues (and save state) and evict the
> > user queues (since HWS is responsible for saving the MQDs of the
> > non-guilty user queues).  If KFD detected the hang (e.g., queue
> > eviction fails when called for memory pressure, etc.), we just need to
> > make sure that it's ok for the sdma reset routine to call evict queues
> > again even if it was already called (presumably it should exit early
> > since the queues are already evicted).  If KGD initiates the reset, it
> > will call into KFD to evict queues.  We just need to make sure the
> > evict queues function we call just evicts the queues and doesn't also
> > try and reset.
>
> If we're removing the src parameter, we need to be careful we don't end up in a double lock scenario in the KFD.
> i.e. kgd inits reset -> kfd detects hang on kgd reset trigger and calls back to kgd -> amdgpu_amdkfd_suspend gets called again but is blocked on previous suspend call from original kgd reset (which is holding a bunch of KFD locks) while KFD is trying to clean up immediately.
>

How would this work even with the src parameter?  I think you could
still get into that case.  E.g., KGD detects a hang and initiates an
engine reset.  In parallel, KFD tries to unmap all queues for some
reason and detects a hang, so it tries to reset SDMA.  Assuming there
is a lock that protects SDMA reset, that should work.  However, it
requires that the prerequisites on each side don't attempt to reset
anything.

sdma_engine_reset()
{

KFD pre reset requirements
1. unmap all SDMA queues (CP firmware will save non-guilty state in MQDs)
2. detect guilty queue

KGD pre reset requirements:
1. stop relevant drm schedulers
2. detect guilt queue
3. save non-guilty queue state

Do engine reset

KGD post reset requirements:
1. restore non-guilty queue state
2. start relevant drm schedulers

KFD post reset requirements
1. map all non-guilty SDMA queues

}

I think what we need on the KFD side, if we don't have it already, is
a function to just umap queues and update guilty state, but not to
attempt to reset anything.  Then on the KFD side, in your normal
flows, you could call this function to unamp queues and update queue
state, and then after calling that walk through the queue state and
trigger any resets based on queues flagged as bad.  On the KFD side,
in a normal flow you will have called this unmap and update state
function and now you have a list of bad queues.  You can then initiate
an engine reset for the engine the bad queue is on.  This is safe
because you've already unmapped the queues, so if the unmap queues
function gets called again from the sdma reset function, it will
return early as the queues are already unmapped and marked guilty if
they are.  The engine will reset, the reset sdma reset function will
clean up the KGD side and then call the KFD map_queues().  Once it
returns you are done.  If KGD detects the hang, it will call the sdma
reset function and that will call the KFD unmap queues and update
state function.  This will update the KFD side state, but not initiate
any resets.  The engine will then be reset and then the KGD state will
be restored and finally it will call the KFD map_queues() function to
remap the non-guilty queues.  At completion both sides should be
functional again.  I'm not too familiar with the KFD unmap and reset
flows, but I think they will need to be decoupled if they are
currently intermixed.

Alex

}

> Safest way to remove the parameter seems like to move the KFD suspend/restore outside of the common call and have KGD call suspend/resume when it's calling the common call itself.
>

> Jon
>
> >
> > Alex
> >
> > On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> > wrote:
> > >
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> > > Sent: Wednesday, March 12, 2025 4:39 PM
> > > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kim, Jonathan
> > <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>
> > > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by
> > removing dynamic callbacks
> > >
> > >
> > >
> > > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
> > >
> > > [SNIP9
> > >
> > > -
> > >
> > > +     gfx_ring->funcs->stop_queue(adev, instance_id);
> > >
> > >
> > >
> > > Yeah that starts to look good. Question here is who is calling
> > amdgpu_sdma_reset_engine()?
> > >
> > >
> > >
> > > If this call comes from engine specific code we might not need the
> > start/stop_queue callbacks all together.
> > >
> > >
> > >
> > >     Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and
> > start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
> > >
> > >
> > > Why would the KFD call this as well? Because it detects an issue with a SDMA
> > user queue  If yes I would rather suggest that the KFD calls the reset function of
> > the paging queue.
> > >
> > > Since this reset function is specific to the SDMA HW generation anyway you don't
> > need those extra functions to abstract starting and stopping of the queue for each
> > HW generation.
> > >
> > > kfd can't call reset function directly, unless we add a parameter src  to distinguish
> > kfd and kgd in reset function, like this:
> > >
> > > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );
> > >
> > > As Alex said in another thread,
> > >
> > > We need to distinguish  kfd and kgd  in reset.
> > >
> > > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset.
> > >
> > > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd
> > needs to save its healthy jobs for recovery).
> > >
> > >
> > >
> > > If we can add a parameter, I am ok for that solution too.
> > >
> > >
> > >
> > > Additionally:
> > >
> > > For sdma6/7, when a queue reset fails, we may need to fall back to an engine
> > reset for a attempt.
> > >
> > >
> > >
> > > Thanks
> > >
> > > Jesse
> > >
> > >
> > > Regards,
> > > Christian.
> > >
> > >
> > >
> > >
> > >
> > >
> > > Thanks
> > >
> > > Jesse
> > >
> > >
> > >
> > > Regards,
> > >
> > > Christian.
> > >
> > >
> > >
> > >       /* Perform the SDMA reset for the specified instance */
> > >
> > >       ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
> > >
> > >       if (ret) {
> > >
> > > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device
> > *adev, uint32_t instance_id, b
> > >
> > >               goto exit;
> > >
> > >       }
> > >
> > >
> > >
> > > -     /* Invoke all registered post_reset callbacks */
> > >
> > > -     list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
> > >
> > > -             if (funcs->post_reset) {
> > >
> > > -                     ret = funcs->post_reset(adev, instance_id);
> > >
> > > -                     if (ret) {
> > >
> > > -                             dev_err(adev->dev,
> > >
> > > -                             "afterReset callback failed for instance %u: %d\n",
> > >
> > > -                                     instance_id, ret);
> > >
> > > -                             goto exit;
> > >
> > > -                     }
> > >
> > > -             }
> > >
> > > -     }
> > >
> > > +     gfx_ring->funcs->start_queue(adev, instance_id);
> > >
> > >
> > >
> > >  exit:
> > >
> > >       /* Restart the scheduler's work queue for the GFX and page rings
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > >
> > > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > >
> > > index fd34dc138081..c1f7ccff9c4e 100644
> > >
> > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > >
> > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > >
> > > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs
> > sdma_v4_4_2_ring_funcs = {
> > >
> > >       .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
> > >
> > >       .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
> > >
> > >       .reset = sdma_v4_4_2_reset_queue,
> > >
> > > +     .stop_queue = sdma_v4_4_2_stop_queue,
> > >
> > > +     .start_queue = sdma_v4_4_2_restore_queue,
> > >
> > >       .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
> > >
> > >
> > >
> > >
> > >
> > >