[Public] > -----Original Message----- > From: Alex Deucher <alexdeucher@xxxxxxxxx> > Sent: Thursday, March 13, 2025 10:38 AM > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx> > Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; > Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong > <Jiadong.Zhu@xxxxxxx> > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by > removing dynamic callbacks > > I think as long as the locking is correct, the src shouldn't matter. > You just need to stop the kernel queues (and save state) and evict the > user queues (since HWS is responsible for saving the MQDs of the > non-guilty user queues). If KFD detected the hang (e.g., queue > eviction fails when called for memory pressure, etc.), we just need to > make sure that it's ok for the sdma reset routine to call evict queues > again even if it was already called (presumably it should exit early > since the queues are already evicted). If KGD initiates the reset, it > will call into KFD to evict queues. We just need to make sure the > evict queues function we call just evicts the queues and doesn't also > try and reset. If we're removing the src parameter, we need to be careful we don't end up in a double lock scenario in the KFD. i.e. kgd inits reset -> kfd detects hang on kgd reset trigger and calls back to kgd -> amdgpu_amdkfd_suspend gets called again but is blocked on previous suspend call from original kgd reset (which is holding a bunch of KFD locks) while KFD is trying to clean up immediately. Safest way to remove the parameter seems like to move the KFD suspend/restore outside of the common call and have KGD call suspend/resume when it's calling the common call itself. Jon > > Alex > > On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx> > wrote: > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > > > > > > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > > Sent: Wednesday, March 12, 2025 4:39 PM > > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kim, Jonathan > <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx> > > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by > removing dynamic callbacks > > > > > > > > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie): > > > > [SNIP9 > > > > - > > > > + gfx_ring->funcs->stop_queue(adev, instance_id); > > > > > > > > Yeah that starts to look good. Question here is who is calling > amdgpu_sdma_reset_engine()? > > > > > > > > If this call comes from engine specific code we might not need the > start/stop_queue callbacks all together. > > > > > > > > Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and > start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2. > > > > > > Why would the KFD call this as well? Because it detects an issue with a SDMA > user queue If yes I would rather suggest that the KFD calls the reset function of > the paging queue. > > > > Since this reset function is specific to the SDMA HW generation anyway you don't > need those extra functions to abstract starting and stopping of the queue for each > HW generation. > > > > kfd can't call reset function directly, unless we add a parameter src to distinguish > kfd and kgd in reset function, like this: > > > > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src ); > > > > As Alex said in another thread, > > > > We need to distinguish kfd and kgd in reset. > > > > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset. > > > > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd > needs to save its healthy jobs for recovery). > > > > > > > > If we can add a parameter, I am ok for that solution too. > > > > > > > > Additionally: > > > > For sdma6/7, when a queue reset fails, we may need to fall back to an engine > reset for a attempt. > > > > > > > > Thanks > > > > Jesse > > > > > > Regards, > > Christian. > > > > > > > > > > > > > > Thanks > > > > Jesse > > > > > > > > Regards, > > > > Christian. > > > > > > > > /* Perform the SDMA reset for the specified instance */ > > > > ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id); > > > > if (ret) { > > > > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device > *adev, uint32_t instance_id, b > > > > goto exit; > > > > } > > > > > > > > - /* Invoke all registered post_reset callbacks */ > > > > - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) { > > > > - if (funcs->post_reset) { > > > > - ret = funcs->post_reset(adev, instance_id); > > > > - if (ret) { > > > > - dev_err(adev->dev, > > > > - "afterReset callback failed for instance %u: %d\n", > > > > - instance_id, ret); > > > > - goto exit; > > > > - } > > > > - } > > > > - } > > > > + gfx_ring->funcs->start_queue(adev, instance_id); > > > > > > > > exit: > > > > /* Restart the scheduler's work queue for the GFX and page rings > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > > > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > > > index fd34dc138081..c1f7ccff9c4e 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > > > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs > sdma_v4_4_2_ring_funcs = { > > > > .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait, > > > > .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper, > > > > .reset = sdma_v4_4_2_reset_queue, > > > > + .stop_queue = sdma_v4_4_2_stop_queue, > > > > + .start_queue = sdma_v4_4_2_restore_queue, > > > > .is_guilty = sdma_v4_4_2_ring_is_guilty, }; > > > > > > > > > > > >