Re: [PATCH] drm/i915/gt: Temporarily force MTL into uncached mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 10, 2023 at 06:17:27PM +0200, Andi Shyti wrote:
> Hi Matt,
> 
> > > > > FIXME: CAT errors are cropping up on MTL.  This removes them,
> > > > > but the real root cause must still be diagnosed.
> > > > 
> > > > Do you have a link to specific IGT test(s) that illustrate the CAT
> > > > errors so that we can ensure that they now appear fixed in CI?
> > > 
> > > this one:
> > > 
> > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@xxxxxxxxxxxxxx
> > > 
> > > Andi
> > 
> > Wait, now I'm confused.  That's a failure caused by a different patch
> > series (one that we won't be moving forward with).  The live@hugepages
> > test is always passing on drm-tip today:
> > https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@xxxxxxxxxxxxxx
> 
> yes, true, but that patch allows us to move forward with the
> testing and hit the CAT error.
> 
> (it was the most reachable link I found :))
> 
> > Is there a test that's giving CAT errors on drm-tip itself (even
> > sporadically) that we can monitor to see the impact of Jonathan's patch
> > here?
> 
> Otherwise this one:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@xxxxxxxxxxxxx#dmesg-warnings11

Okay, looks like this is a pretty sporadic failure:

        https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_fence@parallel@xxxxxxxxx

so we'll need to monitor this for quite a while to make sure it's truly
gone.  Assuming you've done enough local test cycles to confirm that
this definitely avoids the CAT errors,

Acked-by: Matt Roper <matthew.d.roper@xxxxxxxxx>

as a short-term mitigation while we debug further.  We still need to
continue searching for a proper fix and/or drive this through the
hardware team and get them to document this as a new official workaround
for some kind of cache coherency problem.

BTW, it would also be good to have a patch that adds explicit handling
for GuC action 0x6000 (GUC_ACTION_GUC2HOST_NOTIFY_MEMORY_CAT_ERROR) so
that we'll at least have more meaningful error output if/when this is
encountered in the future.


Matt

> 
> Andi

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux