Hi Christian, Right, maybe we could learn something from hardware, and find out the root cause. Best Wishes, Emily Deng > -----Original Message----- > From: Koenig, Christian > Sent: Thursday, March 22, 2018 4:54 PM > To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org > Cc: Liu, Monk <Monk.Liu at amd.com> > Subject: Re: [PATCH] drm/amdgpu: give more chance for tlb flush if failed > > > I think the issue reproduce is reliable. > That is the really important thing. > > We had the same issues with bare metal and working around them one by > one didn't helped at all in the long run (e.g. Vega10 was still horrible > unstable). > > The only real solution was finding the root cause and eliminating it. > For this issue as well as the one Monk investigated it turned out that reading > some registers can cause problems when there are concurrent writes going > on. E.g. the register read times out and leads to the write being dropped. > > If those issues prevail even after eliminating all potential causes on your side > we should probably setup a call with some hw people to discuss how to > proceed further. > > Regards, > Christian. > > Am 22.03.2018 um 04:50 schrieb Deng, Emily: > > Hi Christian, > > I agree with that the patch will hide the real problem, it is just a > workaround, I will change the patch as you suggest. > > as the sriov has lots of issues on the staging, maybe we could first > > submit the two workarounds, later, I will spend some time to find out the > root cause. > > I think the issue reproduce is reliable. > > > > Best Wishes, > > Emily Deng > > > > > >> -----Original Message----- > >> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] > >> Sent: Tuesday, March 20, 2018 6:24 PM > >> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org > >> Cc: Liu, Monk <Monk.Liu at amd.com> > >> Subject: Re: [PATCH] drm/amdgpu: give more chance for tlb flush if > >> failed > >> > >> Am 20.03.2018 um 07:29 schrieb Emily Deng: > >>> under SR-IOV sometimes CPU based tlb flush would timeout within the > >>> given 100ms period, instead let it fail and continue we can give it > >>> more chance to repeat the tlb flush on the failed VMHUB > >>> > >>> this could fix the massive "Timeout waiting for VM flush ACK" > >>> error during vk_encoder test. > >> Well that one is a big NAK since it once more just hides the real > >> problem that we sometimes drop register writes. > >> > >> What we did during debugging to avoid the problem is the following: > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> index a70cbc45c4c1..3536d50375fa 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> @@ -338,6 +338,10 @@ static void gmc_v9_0_flush_gpu_tlb(struct > >>> amdgpu_device *adev, > >>>                u32 tmp = gmc_v9_0_get_invalidate_req(vmid); > >>> > >>>                WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, tmp); > >>> +              while (RREG32_NO_KIQ(hub->vm_inv_eng0_req + eng) != > >>> +tmp) { > >>> +                      DRM_ERROR("Need one more try to write the > >>> VMHUB flush request!"); > >>> +                      WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, > >>> +tmp); > >>> +              } > >>> > >>>                /* Busy wait for ACK.*/ > >>>                for (j = 0; j < 100; j++) { > >> But that can only be a temporary workaround as well. > >> > >> The question is rather can you reliable reproduce this issue with the > >> vk_encoder test? > >> > >> Thanks, > >> Christian. > >> > >>> Signed-off-by: Monk Liu <Monk.Liu at amd.com> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 24 > +++++++++++++++++++-- > >> --- > >>> 1 file changed, 19 insertions(+), 5 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> index a70cbc4..517712b 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c > >>> @@ -329,13 +329,18 @@ static void gmc_v9_0_flush_gpu_tlb(struct > >> amdgpu_device *adev, > >>> { > >>> /* Use register 17 for GART */ > >>> const unsigned eng = 17; > >>> - unsigned i, j; > >>> + unsigned i, j, loop = 0; > >>> + unsigned flush_done = 0; > >>> + > >>> +retry: > >>> > >>> spin_lock(&adev->gmc.invalidate_lock); > >>> > >>> for (i = 0; i < AMDGPU_MAX_VMHUBS; ++i) { > >>> struct amdgpu_vmhub *hub = &adev->vmhub[i]; > >>> u32 tmp = gmc_v9_0_get_invalidate_req(vmid); > >>> + if (flush_done & (1 << i)) /* this vmhub flushed */ > >>> + continue; > >>> > >>> WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, tmp); > >>> > >>> @@ -347,8 +352,10 @@ static void gmc_v9_0_flush_gpu_tlb(struct > >> amdgpu_device *adev, > >>> break; > >>> cpu_relax(); > >>> } > >>> - if (j < 100) > >>> + if (j < 100) { > >>> + flush_done |= (1 << i); > >>> continue; > >>> + } > >>> > >>> /* Wait for ACK with a delay.*/ > >>> for (j = 0; j < adev->usec_timeout; j++) { @@ -358,15 +365,22 > >> @@ > >>> static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, > >>> break; > >>> udelay(1); > >>> } > >>> - if (j < adev->usec_timeout) > >>> + if (j < adev->usec_timeout) { > >>> + flush_done |= (1 << i); > >>> continue; > >>> - > >>> - DRM_ERROR("Timeout waiting for VM flush ACK!\n"); > >>> + } > >>> } > >>> > >>> spin_unlock(&adev->gmc.invalidate_lock); > >>> + if (flush_done != 3) { > >>> + if (loop++ < 3) > >>> + goto retry; > >>> + else > >>> + DRM_ERROR("Timeout waiting for VM flush ACK!\n"); > >>> + } > >>> } > >>> > >>> + > >>> static uint64_t gmc_v9_0_emit_flush_gpu_tlb(struct amdgpu_ring > *ring, > >>> unsigned vmid, uint64_t pd_addr) > >>> {