On Mon, Dec 20, 2021 at 03:00:53PM +0000, Tvrtko Ursulin wrote: > > On 17/12/2021 16:22, Matthew Brost wrote: > > On Fri, Dec 17, 2021 at 12:15:53PM +0000, Tvrtko Ursulin wrote: > > > > > > On 14/12/2021 15:07, Tvrtko Ursulin wrote: > > > > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > > > > > > > Log engine resets done by the GuC firmware in the similar way it is done > > > > by the execlists backend. > > > > > > > > This way we have notion of where the hangs are before the GuC gains > > > > support for proper error capture. > > > > > > Ping - any interest to log this info? > > > > > > All there currently is a non-descriptive "[drm] GPU HANG: ecode > > > 12:0:00000000". > > > > > > > Yea, this could be helpful. One suggestion below. > > > > > Also, will GuC be reporting the reason for the engine reset at any point? > > > > > > > We are working on the error state capture, presumably the registers will > > give a clue what caused the hang. > > > > As for the GuC providing a reason, that isn't defined in the interface > > but that is decent idea to provide a hint in G2H what the issue was. Let > > me run that by the i915 GuC developers / GuC firmware team and see what > > they think. > > > > > Regards, > > > > > > Tvrtko > > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > Cc: John Harrison <John.C.Harrison@xxxxxxxxx> > > > > --- > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 +++++++++++- > > > > 1 file changed, 11 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > index 97311119da6f..51512123dc1a 100644 > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > @@ -11,6 +11,7 @@ > > > > #include "gt/intel_context.h" > > > > #include "gt/intel_engine_pm.h" > > > > #include "gt/intel_engine_heartbeat.h" > > > > +#include "gt/intel_engine_user.h" > > > > #include "gt/intel_gpu_commands.h" > > > > #include "gt/intel_gt.h" > > > > #include "gt/intel_gt_clock_utils.h" > > > > @@ -3934,9 +3935,18 @@ static void capture_error_state(struct intel_guc *guc, > > > > { > > > > struct intel_gt *gt = guc_to_gt(guc); > > > > struct drm_i915_private *i915 = gt->i915; > > > > - struct intel_engine_cs *engine = __context_to_physical_engine(ce); > > > > + struct intel_engine_cs *engine = ce->engine; > > > > intel_wakeref_t wakeref; > > > > + if (intel_engine_is_virtual(engine)) { > > > > + drm_notice(&i915->drm, "%s class, engines 0x%x; GuC engine reset\n", > > > > + intel_engine_class_repr(engine->class), > > > > + engine->mask); > > > > + engine = guc_virtual_get_sibling(engine, 0); > > > > + } else { > > > > + drm_notice(&i915->drm, "%s GuC engine reset\n", engine->name); > > > > Probably include the guc_id of the context too then? > > Is the guc id stable and useful on its own - who would be the user? > Techincally not stable, but in practice it is. The user could be corresponding the context that was reset to a GuC log. More debug info is typically better. Matt > Regards, > > Tvrtko