This is a clear fix : After TDR we have a compute ring HQD restore from its MQD, but the MQD only record "WPTR_ADDR_LO/HI" so once HQD restored the MEC would immediately read value from "WPTR_ADDR_LO/HI" which is a WB memory, and that value is sometime not "0" (because TDR won't clear WB, its value is what a hang process left there ) So MEC consider there is command in RB (since RPTR != WPTR) thus lead to further hang Reviewed-by: Monk Liu <monk.liu@xxxxxxx> _____________________________________ Monk Liu|GPU Virtualization Team |AMD -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Friday, February 28, 2020 5:20 PM To: Tao, Yintian <Yintian.Tao@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/amdgpu: clean wptr on wb when gpu recovery Am 28.02.20 um 07:31 schrieb Yintian Tao: > The TDR will be randomly failed due to compute ring test failure. If > the compute ring wptr & 0x7ff(ring_buf_mask) is 0x100 then after map > mqd the compute ring rptr will be synced with 0x100. And the ring test > packet size is also 0x100. > Then after invocation of amdgpu_ring_commit, the cp will not really > handle the packet on the ring buffer because rptr is equal to wptr. > > Signed-off-by: Yintian Tao <yttao@xxxxxxx> Of hand that looks correct to me, but I can't fully judge if that won't have any negative side effects. Patch is Acked-by: Christian König <christian.koenig@xxxxxxx> for now. Monk according to git you modified that function as well. Could this have any potential negative effect for SRIOV? I don't think so, but better save than sorry. Regards, Christian. > --- > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 1 + > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > index 44f00ecea322..5df1a6d45457 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > @@ -3508,6 +3508,7 @@ static int gfx_v10_0_kcq_init_queue(struct > amdgpu_ring *ring) > > /* reset ring buffer */ > ring->wptr = 0; > + atomic64_set((atomic64_t *)&adev->wb.wb[ring->wptr_offs], 0); > amdgpu_ring_clear_ring(ring); > } else { > amdgpu_ring_clear_ring(ring); > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > index 4135e4126e82..ac22490e8656 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > @@ -3664,6 +3664,7 @@ static int gfx_v9_0_kcq_init_queue(struct > amdgpu_ring *ring) > > /* reset ring buffer */ > ring->wptr = 0; > + atomic64_set((atomic64_t *)&adev->wb.wb[ring->wptr_offs], 0); > amdgpu_ring_clear_ring(ring); > } else { > amdgpu_ring_clear_ring(ring); _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx