On 29/11/2022 21:12, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>
Engine resets are supposed to never happen. But in the case when one
Engine resets or engine reset failures? Hopefully the latter.
does (due to unknwon reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.
Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx>
---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0a42f1807f52c..c82730804a1c4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w)
guc->submission_state.reset_fail_mask = 0;
spin_unlock_irqrestore(&guc->submission_state.lock, flags);
- if (likely(reset_fail_mask))
+ if (likely(reset_fail_mask)) {
+ struct intel_engine_cs *engine;
+ enum intel_engine_id id;
+
+ /*
+ * GuC is toast at this point - it dead loops after sending the failed
+ * reset notification. So need to manually determine the guilty context.
+ * Note that it should be safe/reliable to do this here because the GuC
+ * is toast and will not be scheduling behind the KMD's back.
+ */
+ for_each_engine_masked(engine, gt, reset_fail_mask, id)
+ intel_guc_find_hung_context(engine);
+
intel_gt_handle_error(gt, reset_fail_mask,
I915_ERROR_CAPTURE,
"GuC failed to reset engine mask=0x%x\n",
reset_fail_mask);
If GuC is defined by ABI contract to be dead, should the flow be
attempting to do a full GPU reset here, or maybe it happens somewhere
else as a consequence anyway? (In which case is the engine reset here
even needed?)
Regards,
Tvrtko
+ }
}
int intel_guc_engine_failure_process_msg(struct intel_guc *guc,