Re: [PATCH] drm/i915: fix SFC reset flow

Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx> · Tue, 17 Sep 2019 13:53:11 -0700






On 9/17/2019 12:49 PM, Chris Wilson wrote:
Quoting Daniele Ceraolo Spurio (2019-09-17 20:45:02)

On 9/17/2019 11:57 AM, Chris Wilson wrote:
Quoting Daniele Ceraolo Spurio (2019-09-17 19:36:35)
On 9/17/2019 12:57 AM, Chris Wilson wrote:
Quoting Daniele Ceraolo Spurio (2019-09-16 22:41:04)
Our assumption that the we can ask the HW to lock the SFC even if not
currently in use does not match the HW commitment. The expectation from
the HW is that SW will not try to lock the SFC if the engine is not
using it and if we do that the behavior is undefined; on ICL the HW
ends up to returning the ack and ignoring our lock request, but this is
not guaranteed and we shouldn't expect it going forward.

Reported-by: Owen Zhang <owen.zhang@xxxxxxxxx>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
---
@@ -366,10 +368,13 @@ static u32 gen11_lock_sfc(struct intel_engine_cs *engine)
                                            sfc_forced_lock_ack_bit,
                                            sfc_forced_lock_ack_bit,
                                            1000, 0, NULL)) {
-               DRM_DEBUG_DRIVER("Wait for SFC forced lock ack failed\n");
+               /* did we race the unlock? */
+               if (intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit)
+                       DRM_ERROR("Wait for SFC forced lock ack failed\n");
What's our plan if this *ERROR* is ever triggered?

If it remains do nothing and check the logs on death, then it remains
just a debug splat. If there is a plan to actually do something to
handle the error, do it!
-Chris
AFAIU the only thing we can do is escalate to full gpu reset. However,
the probability of this failing should be next to non-existent (only one
engine can use the SFC at any time so there is no lock contention), so
I'm not convinced the fallback is worth the effort. The error is still
useful IMO to catch unexpected behavior on new platforms, as it happened
in this case with the media team reporting seeing this message on gen12
with the previous behavior. This said, I'm happy to add the extra logic
if you believe it is worth it.
We've see this message on every icl run!
-Chris
I've never noticed it, which tests are hitting it? My understanding from
what the HW team said is that on ICL the ack will always come back (even
if it is not part of the "official" SW/HW interface) and the HW tweak
that stops that is a gen12 change. Something else might be wrong is this
is firing off in our ICL CI, also because I don't think we have any test
case that actually uses the SFC, do we?
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6911/fi-icl-u2/igt@i915_selftest@live_hangcheck.html

All icl, live_hangcheck or live_reset, for as long as I can remember.
-Chris

Thanks. I'm going to check with the HW team to see what their 
recommended timeout value is, in case ours is too short. It could also 
be that even on ICL the ack is not always returned if the SFC is not 
actually in use.

Daniele

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx