Re: [PATCH] drm/i915/gt: Temporarily force MTL into uncached mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/10/2023 09:44, Matt Roper wrote:
On Tue, Oct 10, 2023 at 05:42:28PM +0100, Tvrtko Ursulin wrote:
On 10/10/2023 17:17, Andi Shyti wrote:
Hi Matt,

FIXME: CAT errors are cropping up on MTL.  This removes them,
but the real root cause must still be diagnosed.
Do you have a link to specific IGT test(s) that illustrate the CAT
errors so that we can ensure that they now appear fixed in CI?
this one:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124599v1/bat-mtlp-8/igt@i915_selftest@live@xxxxxxxxxxxxxx

Andi
Wait, now I'm confused.  That's a failure caused by a different patch
series (one that we won't be moving forward with).  The live@hugepages
test is always passing on drm-tip today:
https://intel-gfx-ci.01.org/tree/drm-tip/igt@i915_selftest@live@xxxxxxxxxxxxxx
yes, true, but that patch allows us to move forward with the
testing and hit the CAT error.

(it was the most reachable link I found :))

Is there a test that's giving CAT errors on drm-tip itself (even
sporadically) that we can monitor to see the impact of Jonathan's patch
here?
Otherwise this one:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13667/re-mtlp-3/igt@gem_exec_fence@xxxxxxxxxxxxx#dmesg-warnings11
Parachuting in on a tangent - please do not mix CAT and CT errors. CAT, for me at least, associates with CATastrophic faults reported over CT channel, like GuC page faulting IIRC.

For CT errors maybe GuC folks can sched some light what they mean.
0x6000 is GUC_ACTION_GUC2HOST_NOTIFY_MEMORY_CAT_ERROR so this actually
is a CAT error, delivered via the CT channel.
The history is that catastrophic memory errors (CAT is an abbreviation not an acronym) are never meant to happen in the upstream driver because we map all invalid addresses to a scratch page and silently hide such accesses. Hence there has been push back on adding support for an error channel which is officially impossible to hit. The problem is that we keep hitting it due to hardware and/or software bugs.

Because there is no official support for handling this notification, the CT layer reports it as an unexpected notification and barfs. As far as the CT layer is concerned, it is a corrupted packet from GuC. And thus the error reporting looks totally weird for what is just an illegal address access from some random part of the GPU. And note that it is very unlikely that GuC itself caused the page fault. It is much more plausible to be coming from an engine/EU/batch buffer instruction. Although as noted, the fundamental cause is believed to be broken page table updates due to cache coherency issues.

John.



Matt

Regards,

Tvrtko




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux