[AMD Official Use Only - General] Hi Andrey, Problem with status.hang is that it is set at amdgpu_device_ip_check_soft_reset, which is not implemented in nv or gfx10. They have to be nicely implemented first. Another option I thought is to mark status.hang or add a flag to amdgpu_gfx when job timeout reported on gfx/comp ring. And this will require some logic to map the relationship for ring and ip blocks. This way does not look good as well. Thanks, Victor -----Original Message----- From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> Sent: Wednesday, July 27, 2022 12:57 AM To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time On 2022-07-26 05:40, Zhao, Victor wrote: > [AMD Official Use Only - General] > > Hi Andrey, > > Reply inline. > > > Thanks, > Victor > > > > -----Original Message----- > From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > Sent: Tuesday, July 26, 2022 5:18 AM > To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Deng, Emily > <Emily.Deng@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> > Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time > > > On 2022-07-22 03:34, Victor Zhao wrote: >> In multi container use case, reset time is important, so skip ring >> tests and cp halt wait during ip suspending for reset as they are >> going to fail and cost more time on reset > > Why are they failing in this case ? Skipping ring tests is not the best idea as you loose important indicator of system's sanity. Is there any way to make them work ? > > [Victor]: I've seen gfx ring test fail every time after a gfx engine hang. I thought it should be expected as gfx is in a bad state. Do you know the reason we have ring tests before reset? As we are going to reset the asic anyway. > Another approach could be to make the skip mode2 only or reduce the wait time here. I dug down in history and according to commit 'drm/amdgpu:unmap KCQ in gfx hw_fini(v2)' you need to write to scratch register for completion of queue unmap operation so you defently don't want to just skip it. I agree in case that the ring is hung this has no point but remember that GPU reset can happen not only to a hunged ring but for other reasons (RAS, manual reset e.t.c.) in which case you probably want to shut down gracefully here ? I see we have adev->ip_blocks[i].status.hang flag which you maybe can use here instead ? > > >> Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +- >> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 26 +++++++++++++++---------- >> 2 files changed, 17 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >> index 222d3d7ea076..f872495ccc3a 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >> @@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev) >> kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i], >> RESET_QUEUES, 0, 0); >> >> - if (adev->gfx.kiq.ring.sched.ready) >> + if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev)) >> r = amdgpu_ring_test_helper(kiq_ring); >> spin_unlock(&adev->gfx.kiq.ring_lock); >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c >> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c >> index fafbad3cf08d..9ae29023e38f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c >> @@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device *adev, bool enable) >> WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp); >> } >> >> - for (i = 0; i < adev->usec_timeout; i++) { >> - if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0) >> - break; >> - udelay(1); >> - } >> - >> - if (i >= adev->usec_timeout) >> - DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt"); >> + if (!amdgpu_in_reset(adev)) { >> + for (i = 0; i < adev->usec_timeout; i++) { >> + if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0) >> + break; >> + udelay(1); >> + } >> >> + if (i >= adev->usec_timeout) >> + DRM_ERROR("failed to %s cp gfx\n", >> + enable ? "unhalt" : "halt"); >> + } >> return 0; >> + >> } > > This change has impact beyond container case no ? We had no issue with this code during regular reset cases so why we would give up on this code which confirms CP is idle ? What is the side effect of skipping this during all GPU resets ? > > Andrey > > [Victor]: I see "failed to halt cp gfx" with regular reset cases as well when doing a gfx hang test using quark. I haven't seen a side effect with Mode1 reset yet but maybe shorten the wait time could be better? Same as above i guess, it would indeed time out for a hung ring but GPU reset happens not only because of hung rings but for other reasons. Andrey > >> >> static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct >> amdgpu_device >> *adev) @@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct amdgpu_device *adev) >> for (i = 0; i < adev->gfx.num_gfx_rings; i++) >> kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i], >> PREEMPT_QUEUES, 0, 0); >> - >> - return amdgpu_ring_test_helper(kiq_ring); >> + if (!amdgpu_in_reset(adev)) >> + return amdgpu_ring_test_helper(kiq_ring); >> + else >> + return 0; >> } >> #endif >> >> @@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle) >> >> return 0; >> } >> + >> gfx_v10_0_cp_enable(adev, false); >> gfx_v10_0_enable_gui_idle_interrupt(adev, false); >>