On Thu, Dec 16, 2021 at 09:52:26AM +0000, Tvrtko Ursulin wrote: > > > On 09/12/2021 12:01, Tejas Upadhyay wrote: > > Most code paths does full reset with preparing all > > engines for reset except below two : > > > > 1. Single engine reset needs to prepare engines for > > reset based on its reset domain. In __intel_engine > > _reset_bh is a place needs loop over to do engine > > prepare for all engines which are in same reset > > domain before triggering reset. > > > > 2. enable_error_interrupt() in drivers/gpu/drm/i915/ > > gt/intel_execlists_submission.c needs similar change. > > > > whenever there is full reset done, engine prepare for > > all engines are already being called right now before > > actual reset triggered, except above two scenario > > seeking single engine reset. > > > > Note: Requirement of this change is occurred recently > > because whenever engine does reset, all engines in > > same reset domain gets reset and in case engine goes > > for reset before stopping CS or applying required W/A, > > there are high chances of hang/crash. reset_prepare_ > > engine takes care of it. > > On the trivial level, please wrap your commit messages to standard 75 wide. > See Documentation/process/submitting-patches.rst. > > > > > Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 9 +++++++++ > > drivers/gpu/drm/i915/gt/intel_reset.c | 12 ++++++++++-- > > drivers/gpu/drm/i915/gt/intel_reset.h | 1 + > > 3 files changed, 20 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > index a69df5e9e77a..668e7ba5b254 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > @@ -2806,6 +2806,15 @@ static void enable_error_interrupt(struct intel_engine_cs *engine) > > drm_err(&engine->i915->drm, > > "engine '%s' resumed still in error: %08x\n", > > engine->name, status); > > + if (engine->reset_domain) { > > + struct intel_engine_cs *nengine; > > + enum intel_engine_id id; > > + > > + for_each_engine(nengine, engine->gt, id) > > + if (nengine->reset_domain == > > + engine->reset_domain) > > + reset_prepare_engine(nengine); > > + } > > __intel_gt_reset(engine->gt, engine->mask); > > } > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c > > index 63199f0550e6..454d6ab1d9f4 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c > > @@ -705,7 +705,7 @@ int intel_reset_guc(struct intel_gt *gt) > > * Ensure irq handler finishes, and not run again. > > * Also return the active request so that we only search for it once. > > */ > > -static void reset_prepare_engine(struct intel_engine_cs *engine) > > +void reset_prepare_engine(struct intel_engine_cs *engine) > > { > > /* > > * During the reset sequence, we must prevent the engine from > > @@ -1167,7 +1167,15 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg) > > if (!intel_engine_pm_get_if_awake(engine)) > > return 0; > > - reset_prepare_engine(engine); > > + if (engine->reset_domain) { > > Can it be zero? > > > + struct intel_engine_cs *nengine; > > + enum intel_engine_id id; > > + > > + for_each_engine(nengine, gt, id) > > + if (nengine->reset_domain == > > + engine->reset_domain) > > + reset_prepare_engine(nengine); > > Having glanced over the discussion about the workaround you are trying to > implement a few times, this is what I think. > > First of all I think you are definitely missing the engine resume side of > things. Consider this entry point: > > execlists_submission_tasklet > execlists_reset > intel_engine_reset > __intel_engine_reset_bh > > AFAICT you can halt more than one engine but you will not resume them all. > What I mean here is that __intel_engine_reset, intel_engine_resume, > reset_finish_engine and even the engine pm management end up being called > for only one engine. Also absence of locking via I915_RESET_ENGINE in gt > reset flags. > > Secondly, the question on whether it is acceptable to corrupt the state for > the other engine seems not to have a 100% clear answer. Can this be > confirmed? Because in the ideal world you would really need to preempt to > idle the other engines from the engine group and only then would be able to > proceed. we shouldn't do the reset without checking if it is idle or preempting to idle. and handle the FORCE_WAKE case as well... > > If that is considered a problem for later, with the first step being adding > of a simpler workaround to prevent a more serious lockup, then it may be > acceptable with some tweaks. > > I suspect there isn't a proper IGT for this otherwise you would have noticed > the failure to resume the coupled engines? So I think next step is to write > that IGT and then make sure code is robust. > > Sufficient testing will probably drive the implementation. > > You may for instance need to lock all engines for reset starting from > execlists_reset() and you may end up uncovering locking/ordering challenges > as you go. and everything else that Tvrtko said... we cannot go with this as is without ensuring that we have a good flow with the scheduler. > > Regards, > > Tvrtko > > > + } > > if (msg) > > drm_notice(&engine->i915->drm, > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h > > index adc734e67387..7abd5d49f0e5 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_reset.h > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.h > > @@ -28,6 +28,7 @@ void intel_gt_handle_error(struct intel_gt *gt, > > const char *fmt, ...); > > #define I915_ERROR_CAPTURE BIT(0) > > +void reset_prepare_engine(struct intel_engine_cs *engine); > > void intel_gt_reset(struct intel_gt *gt, > > intel_engine_mask_t stalled_mask, > > const char *reason); > >