RE: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Thursday, March 13, 2025 10:38 AM
> To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>;
> Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong
> <Jiadong.Zhu@xxxxxxx>
> Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by
> removing dynamic callbacks
>
> I think as long as the locking is correct, the src shouldn't matter.
> You just need to stop the kernel queues (and save state) and evict the
> user queues (since HWS is responsible for saving the MQDs of the
> non-guilty user queues).  If KFD detected the hang (e.g., queue
> eviction fails when called for memory pressure, etc.), we just need to
> make sure that it's ok for the sdma reset routine to call evict queues
> again even if it was already called (presumably it should exit early
> since the queues are already evicted).  If KGD initiates the reset, it
> will call into KFD to evict queues.  We just need to make sure the
> evict queues function we call just evicts the queues and doesn't also
> try and reset.

If we're removing the src parameter, we need to be careful we don't end up in a double lock scenario in the KFD.
i.e. kgd inits reset -> kfd detects hang on kgd reset trigger and calls back to kgd -> amdgpu_amdkfd_suspend gets called again but is blocked on previous suspend call from original kgd reset (which is holding a bunch of KFD locks) while KFD is trying to clean up immediately.

Safest way to remove the parameter seems like to move the KFD suspend/restore outside of the common call and have KGD call suspend/resume when it's calling the common call itself.

Jon

>
> Alex
>
> On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >
> >
> >
> >
> >
> > From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> > Sent: Wednesday, March 12, 2025 4:39 PM
> > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kim, Jonathan
> <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>
> > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by
> removing dynamic callbacks
> >
> >
> >
> > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
> >
> > [SNIP9
> >
> > -
> >
> > +     gfx_ring->funcs->stop_queue(adev, instance_id);
> >
> >
> >
> > Yeah that starts to look good. Question here is who is calling
> amdgpu_sdma_reset_engine()?
> >
> >
> >
> > If this call comes from engine specific code we might not need the
> start/stop_queue callbacks all together.
> >
> >
> >
> >     Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and
> start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
> >
> >
> > Why would the KFD call this as well? Because it detects an issue with a SDMA
> user queue  If yes I would rather suggest that the KFD calls the reset function of
> the paging queue.
> >
> > Since this reset function is specific to the SDMA HW generation anyway you don't
> need those extra functions to abstract starting and stopping of the queue for each
> HW generation.
> >
> > kfd can't call reset function directly, unless we add a parameter src  to distinguish
> kfd and kgd in reset function, like this:
> >
> > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );
> >
> > As Alex said in another thread,
> >
> > We need to distinguish  kfd and kgd  in reset.
> >
> > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset.
> >
> > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd
> needs to save its healthy jobs for recovery).
> >
> >
> >
> > If we can add a parameter, I am ok for that solution too.
> >
> >
> >
> > Additionally:
> >
> > For sdma6/7, when a queue reset fails, we may need to fall back to an engine
> reset for a attempt.
> >
> >
> >
> > Thanks
> >
> > Jesse
> >
> >
> > Regards,
> > Christian.
> >
> >
> >
> >
> >
> >
> > Thanks
> >
> > Jesse
> >
> >
> >
> > Regards,
> >
> > Christian.
> >
> >
> >
> >       /* Perform the SDMA reset for the specified instance */
> >
> >       ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
> >
> >       if (ret) {
> >
> > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device
> *adev, uint32_t instance_id, b
> >
> >               goto exit;
> >
> >       }
> >
> >
> >
> > -     /* Invoke all registered post_reset callbacks */
> >
> > -     list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
> >
> > -             if (funcs->post_reset) {
> >
> > -                     ret = funcs->post_reset(adev, instance_id);
> >
> > -                     if (ret) {
> >
> > -                             dev_err(adev->dev,
> >
> > -                             "afterReset callback failed for instance %u: %d\n",
> >
> > -                                     instance_id, ret);
> >
> > -                             goto exit;
> >
> > -                     }
> >
> > -             }
> >
> > -     }
> >
> > +     gfx_ring->funcs->start_queue(adev, instance_id);
> >
> >
> >
> >  exit:
> >
> >       /* Restart the scheduler's work queue for the GFX and page rings
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >
> > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >
> > index fd34dc138081..c1f7ccff9c4e 100644
> >
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >
> > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs
> sdma_v4_4_2_ring_funcs = {
> >
> >       .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
> >
> >       .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
> >
> >       .reset = sdma_v4_4_2_reset_queue,
> >
> > +     .stop_queue = sdma_v4_4_2_stop_queue,
> >
> > +     .start_queue = sdma_v4_4_2_restore_queue,
> >
> >       .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
> >
> >
> >
> >
> >
> >




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux