On Wed, Aug 21, 2024 at 5:59 PM Felix Kuehling <felix.kuehling@xxxxxxx> wrote: > > > On 2024-08-20 16:25, Alex Deucher wrote: > > Pending extended validation. > > > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 4 ++++ > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 4 ++++ > > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 ++++++ > > 3 files changed, 14 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c > > index c63528a4e8941..1254a43ec96b6 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c > > @@ -1151,6 +1151,10 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct amdgpu_device *adev, > > uint32_t low, high; > > uint64_t queue_addr = 0; > > > > + if (!adev->debug_exp_resets && > > + !adev->gfx.num_gfx_rings) > > + return 0; > > + > > Did you put this in the HW-specific code path intentionally? If you want > this check to apply to all ASICs, you should put it into > detect_queue_hang in kfd_device_queue_manager.c. But maybe the extended > validation is HW-specific. I only want to apply it to MI parts at this point. We will likely have a different default on other parts. Alex > > Either way, the patch is > > Acked-by: Felix Kuehling <felix.kuehling@xxxxxxx> > > > > kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst); > > amdgpu_gfx_rlc_enter_safe_mode(adev, inst); > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > index 21089aadbb7b4..8cf5d7925b51c 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > @@ -7233,6 +7233,10 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring, > > unsigned long flags; > > int i, r; > > > > + if (!adev->debug_exp_resets && > > + !adev->gfx.num_gfx_rings) > > + return -EINVAL; > > + > > if (amdgpu_sriov_vf(adev)) > > return -EINVAL; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > index 2067f26d3a9d8..f8649546b9c4c 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > @@ -3052,6 +3052,9 @@ static void gfx_v9_4_3_ring_soft_recovery(struct amdgpu_ring *ring, > > struct amdgpu_device *adev = ring->adev; > > uint32_t value = 0; > > > > + if (!adev->debug_exp_resets) > > + return; > > + > > value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03); > > value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01); > > value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1); > > @@ -3475,6 +3478,9 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring, > > unsigned long flags; > > int r, i; > > > > + if (!adev->debug_exp_resets) > > + return -EINVAL; > > + > > if (amdgpu_sriov_vf(adev)) > > return -EINVAL; > >