Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

Alex Deucher <alexdeucher@xxxxxxxxx> · Wed, 13 Dec 2023 14:44:42 -0500

On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
<mario.limonciello@xxxxxxx> wrote:
>
> On 12/13/2023 13:12, Mario Limonciello wrote:
> > On 12/13/2023 13:07, Alex Deucher wrote:
> >> On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
> >> <mario.limonciello@xxxxxxx> wrote:
> >>>
> >>> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> >>> causes the first MES packet after resume to fail. This packet is
> >>> used to flush the TLB when GART is enabled.
> >>>
> >>> This issue is fixed in newer firmware, but as OEMs may not roll this
> >>> out to the field, introduce a workaround that will retry the flush
> >>> when detecting running on an older firmware and decrease relevant
> >>> error messages to debug while workaround is in use.
> >>>
> >>> Cc: stable@xxxxxxxxxxxxxxx # 6.1+
> >>> Cc: Tim Huang <Tim.Huang@xxxxxxx>
> >>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> >>> Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 ++++++++--
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
> >>>   drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 ++++++++++++++++-
> >>>   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++++++--
> >>>   4 files changed, 32 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> index 9ddbf1494326..6ce3f6e6b6de 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
> >>> amdgpu_device *adev,
> >>>          }
> >>>
> >>>          r = adev->mes.funcs->misc_op(&adev->mes, &op_input);
> >>> -       if (r)
> >>> -               DRM_ERROR("failed to reg_write_reg_wait\n");
> >>> +       if (r) {
> >>> +               const char *msg = "failed to reg_write_reg_wait\n";
> >>> +
> >>> +               if (adev->mes.suspend_workaround)
> >>> +                       DRM_DEBUG(msg);
> >>> +               else
> >>> +                       DRM_ERROR(msg);
> >>> +       }
> >>>
> >>>   error:
> >>>          return r;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> index a27b424ffe00..90f2bba3b12b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> @@ -135,6 +135,8 @@ struct amdgpu_mes {
> >>>
> >>>          /* ip specific functions */
> >>>          const struct amdgpu_mes_funcs   *funcs;
> >>> +
> >>> +       bool                            suspend_workaround;
> >>>   };
> >>>
> >>>   struct amdgpu_mes_process {
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> index 23d7b548d13f..e810c7bb3156 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
> >>> amdgpu_device *adev)
> >>>                  false : true;
> >>>
> >>>          adev->mmhub.funcs->set_fault_enable_default(adev, value);
> >>> -       gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +
> >>> +       do {
> >>> +               gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +               adev->mes.suspend_workaround = false;
> >>> +       } while (adev->mes.suspend_workaround);
> >>
> >> Shouldn't this be something like:
> >>
> >>> +       do {
> >>> +               gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +               adev->mes.suspend_workaround = false;
> >>> +               gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +       } while (adev->mes.suspend_workaround);
> >>
> >> If we actually need the flush.  Maybe a better approach would be to
> >> check if we are in s0ix in
> >
> > Ah you're right; I had shifted this around to keep less stateful
> > variables and push them up the stack from when I first made it and that
> > logic is wrong now.
> >
> > I don't think the one you suggested is right either; it's going to apply
> > twice on ASICs that only need it once.
> >
> > I guess pending on what Christian comments on below I'll respin to logic
> > that only calls twice on resume for these ASICs.
>
> One more comment.  Tim and I both did an experiment for this (skipping
> the flush) separately.  The problem isn't the flush itself, rather it's
> the first MES packet after exiting GFXOFF.
>
> So it seems that it pushes off the issue to the next thing which is a
> ring buffer test:
>
> [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.0
> (-110).
> [drm:process_one_work] *ERROR* ib ring test failed (-110).
>
> So maybe a better workaround is a "dummy" command that is only sent for
> the broken firmware that we don't care about the outcome and discard errors.
>
> Then the workaround doesn't need to get as entangled with correct flow.

Yeah. Something like that seems cleaner.  Just a question of where to
put it since we skip GC and MES for s0ix.  Probably somewhere in
gmc_v11_0_resume() before gmc_v11_0_gart_enable().  Maybe add a new
mes callback.

Alex

>
> >
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c in gmc_v11_0_flush_gpu_tlb():
> >> index 23d7b548d13f..bd6d9953a80e 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >> @@ -227,7 +227,8 @@ static void gmc_v11_0_flush_gpu_tlb(struct
> >> amdgpu_device *adev, uint32_t vmid,
> >>           * Directly use kiq to do the vm invalidation instead
> >>           */
> >>          if ((adev->gfx.kiq[0].ring.sched.ready ||
> >> adev->mes.ring.sched.ready) &&
> >> -           (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> >> +           (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) ||
> >> +           !adev->in_s0ix) {
> >>                  amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack,
> >> inv_req,
> >>                                  1 << vmid, GET_INST(GC, 0));
> >>                  return;
> >>
> >> @Christian Koenig is this logic correct?
> >>
> >>          /* For SRIOV run time, driver shouldn't access the register
> >> through MMIO
> >>           * Directly use kiq to do the vm invalidation instead
> >>           */
> >>          if ((adev->gfx.kiq[0].ring.sched.ready ||
> >> adev->mes.ring.sched.ready) &&
> >>              (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> >>                  amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack,
> >> inv_req,
> >>                                  1 << vmid, GET_INST(GC, 0));
> >>                  return;
> >>          }
> >>
> >> We basically always use the MES with that logic.  If that is the case,
> >> we should just drop the rest of that function.  Shouldn't we only use
> >> KIQ or MES for SR-IOV?  gmc v10 has similar logic which also seems
> >> wrong.
> >>
> >> Alex
> >>
> >>
> >>>
> >>>          DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n",
> >>>                   (unsigned int)(adev->gmc.gart_size >> 20),
> >>> @@ -960,6 +964,17 @@ static int gmc_v11_0_resume(void *handle)
> >>>          int r;
> >>>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >>>
> >>> +       switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> >>> +       case IP_VERSION(13, 0, 4):
> >>> +       case IP_VERSION(13, 0, 11):
> >>> +               /* avoid problems with first TLB flush after resume */
> >>> +               if ((adev->pm.fw_version & 0x00FFFFFF) < 0x004c4900)
> >>> +                       adev->mes.suspend_workaround = adev->in_s0ix;
> >>> +               break;
> >>> +       default:
> >>> +               break;
> >>> +       }
> >>> +
> >>>          r = gmc_v11_0_hw_init(adev);
> >>>          if (r)
> >>>                  return r;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> >>> b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> >>> index 4dfec56e1b7f..84ab8c611e5e 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> >>> @@ -137,8 +137,12 @@ static int
> >>> mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
> >>>          r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq,
> >>>                        timeout);
> >>>          if (r < 1) {
> >>> -               DRM_ERROR("MES failed to response msg=%d\n",
> >>> -                         x_pkt->header.opcode);
> >>> +               if (mes->suspend_workaround)
> >>> +                       DRM_DEBUG("MES failed to response msg=%d\n",
> >>> +                                 x_pkt->header.opcode);
> >>> +               else
> >>> +                       DRM_ERROR("MES failed to response msg=%d\n",
> >>> +                                 x_pkt->header.opcode);
> >>>
> >>>                  while (halt_if_hws_hang)
> >>>                          schedule();
> >>> --
> >>> 2.34.1
> >>>
> >
>