On Tue, Nov 29, 2022 at 01:12:53PM -0800, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx> Engine resets are supposed to never happen. But in the case when one does (due to unknwon reasons that normally come down to a missing w/a), it is useful to get as much information out of the system as possible. Given that the GuC effectively dies on such a situation, it is not possible to get a guilty context notification back. So do a manual search instead. Given that GuC is dead, this is safe because GuC won't be changing the engine state asynchronously. Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx> --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 0a42f1807f52c..c82730804a1c4 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be safe/reliable to do this here because the GuC + * is toast and will not be scheduling behind the KMD's back. + */
Is that defined by the kmd-GuC interface that following a failed reset notification, GuC will always dead-loop OR not schedule anything (even on other engines) until KMD takes some action? What action should KMD take?
Regards, Umesh
+ for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, "GuC failed to reset engine mask=0x%x\n", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc, -- 2.37.3