On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello <mario.limonciello@xxxxxxx> wrote: > > On 12/13/2023 13:12, Mario Limonciello wrote: > > On 12/13/2023 13:07, Alex Deucher wrote: > >> On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello > >> <mario.limonciello@xxxxxxx> wrote: > >>> > >>> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that > >>> causes the first MES packet after resume to fail. This packet is > >>> used to flush the TLB when GART is enabled. > >>> > >>> This issue is fixed in newer firmware, but as OEMs may not roll this > >>> out to the field, introduce a workaround that will retry the flush > >>> when detecting running on an older firmware and decrease relevant > >>> error messages to debug while workaround is in use. > >>> > >>> Cc: stable@xxxxxxxxxxxxxxx # 6.1+ > >>> Cc: Tim Huang <Tim.Huang@xxxxxxx> > >>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045 > >>> Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 ++++++++-- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 2 ++ > >>> drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 17 ++++++++++++++++- > >>> drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 8 ++++++-- > >>> 4 files changed, 32 insertions(+), 5 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > >>> index 9ddbf1494326..6ce3f6e6b6de 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > >>> @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct > >>> amdgpu_device *adev, > >>> } > >>> > >>> r = adev->mes.funcs->misc_op(&adev->mes, &op_input); > >>> - if (r) > >>> - DRM_ERROR("failed to reg_write_reg_wait\n"); > >>> + if (r) { > >>> + const char *msg = "failed to reg_write_reg_wait\n"; > >>> + > >>> + if (adev->mes.suspend_workaround) > >>> + DRM_DEBUG(msg); > >>> + else > >>> + DRM_ERROR(msg); > >>> + } > >>> > >>> error: > >>> return r; > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > >>> index a27b424ffe00..90f2bba3b12b 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > >>> @@ -135,6 +135,8 @@ struct amdgpu_mes { > >>> > >>> /* ip specific functions */ > >>> const struct amdgpu_mes_funcs *funcs; > >>> + > >>> + bool suspend_workaround; > >>> }; > >>> > >>> struct amdgpu_mes_process { > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >>> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >>> index 23d7b548d13f..e810c7bb3156 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >>> @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct > >>> amdgpu_device *adev) > >>> false : true; > >>> > >>> adev->mmhub.funcs->set_fault_enable_default(adev, value); > >>> - gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0); > >>> + > >>> + do { > >>> + gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0); > >>> + adev->mes.suspend_workaround = false; > >>> + } while (adev->mes.suspend_workaround); > >> > >> Shouldn't this be something like: > >> > >>> + do { > >>> + gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0); > >>> + adev->mes.suspend_workaround = false; > >>> + gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0); > >>> + } while (adev->mes.suspend_workaround); > >> > >> If we actually need the flush. Maybe a better approach would be to > >> check if we are in s0ix in > > > > Ah you're right; I had shifted this around to keep less stateful > > variables and push them up the stack from when I first made it and that > > logic is wrong now. > > > > I don't think the one you suggested is right either; it's going to apply > > twice on ASICs that only need it once. > > > > I guess pending on what Christian comments on below I'll respin to logic > > that only calls twice on resume for these ASICs. > > One more comment. Tim and I both did an experiment for this (skipping > the flush) separately. The problem isn't the flush itself, rather it's > the first MES packet after exiting GFXOFF. > > So it seems that it pushes off the issue to the next thing which is a > ring buffer test: > > [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.0 > (-110). > [drm:process_one_work] *ERROR* ib ring test failed (-110). > > So maybe a better workaround is a "dummy" command that is only sent for > the broken firmware that we don't care about the outcome and discard errors. > > Then the workaround doesn't need to get as entangled with correct flow. Yeah. Something like that seems cleaner. Just a question of where to put it since we skip GC and MES for s0ix. Probably somewhere in gmc_v11_0_resume() before gmc_v11_0_gart_enable(). Maybe add a new mes callback. Alex > > > > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c in gmc_v11_0_flush_gpu_tlb(): > >> index 23d7b548d13f..bd6d9953a80e 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > >> @@ -227,7 +227,8 @@ static void gmc_v11_0_flush_gpu_tlb(struct > >> amdgpu_device *adev, uint32_t vmid, > >> * Directly use kiq to do the vm invalidation instead > >> */ > >> if ((adev->gfx.kiq[0].ring.sched.ready || > >> adev->mes.ring.sched.ready) && > >> - (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) { > >> + (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) || > >> + !adev->in_s0ix) { > >> amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, > >> inv_req, > >> 1 << vmid, GET_INST(GC, 0)); > >> return; > >> > >> @Christian Koenig is this logic correct? > >> > >> /* For SRIOV run time, driver shouldn't access the register > >> through MMIO > >> * Directly use kiq to do the vm invalidation instead > >> */ > >> if ((adev->gfx.kiq[0].ring.sched.ready || > >> adev->mes.ring.sched.ready) && > >> (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) { > >> amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, > >> inv_req, > >> 1 << vmid, GET_INST(GC, 0)); > >> return; > >> } > >> > >> We basically always use the MES with that logic. If that is the case, > >> we should just drop the rest of that function. Shouldn't we only use > >> KIQ or MES for SR-IOV? gmc v10 has similar logic which also seems > >> wrong. > >> > >> Alex > >> > >> > >>> > >>> DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n", > >>> (unsigned int)(adev->gmc.gart_size >> 20), > >>> @@ -960,6 +964,17 @@ static int gmc_v11_0_resume(void *handle) > >>> int r; > >>> struct amdgpu_device *adev = (struct amdgpu_device *)handle; > >>> > >>> + switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) { > >>> + case IP_VERSION(13, 0, 4): > >>> + case IP_VERSION(13, 0, 11): > >>> + /* avoid problems with first TLB flush after resume */ > >>> + if ((adev->pm.fw_version & 0x00FFFFFF) < 0x004c4900) > >>> + adev->mes.suspend_workaround = adev->in_s0ix; > >>> + break; > >>> + default: > >>> + break; > >>> + } > >>> + > >>> r = gmc_v11_0_hw_init(adev); > >>> if (r) > >>> return r; > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > >>> b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > >>> index 4dfec56e1b7f..84ab8c611e5e 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > >>> @@ -137,8 +137,12 @@ static int > >>> mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, > >>> r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, > >>> timeout); > >>> if (r < 1) { > >>> - DRM_ERROR("MES failed to response msg=%d\n", > >>> - x_pkt->header.opcode); > >>> + if (mes->suspend_workaround) > >>> + DRM_DEBUG("MES failed to response msg=%d\n", > >>> + x_pkt->header.opcode); > >>> + else > >>> + DRM_ERROR("MES failed to response msg=%d\n", > >>> + x_pkt->header.opcode); > >>> > >>> while (halt_if_hws_hang) > >>> schedule(); > >>> -- > >>> 2.34.1 > >>> > > >