On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote: > From: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > > This is a preparatory patch which modifies error handler to do per engine > hang recovery. The actual patch which implements this sequence follows > later in the series. The aim is to prepare existing recovery function to > adapt to this new function where applicable (which fails at this point > because core implementation is lacking) and continue recovery using legacy > full gpu reset. > > A helper function is also added to query the availability of engine > reset. > > The error events behaviour that are used to notify user of reset are > adapted to engine reset such that it doesn't break users listening to these > events. In legacy we report an error event, a reset event before resetting > the gpu and a reset done event marking the completion of reset. The same > behaviour is adapted but reset event is only dispatched once even when > multiple engines are hung. Finally once reset is complete we send reset > done event as usual. > > Note that this implementation of engine reset is for i915 directly > submitting to the ELSP, where the driver manages the hang detection, > recovery and resubmission. With GuC submission these tasks are shared > between driver and firmware; i915 will still responsible for detecting a > hang, and when it does it will have to request GuC to reset that Engine and > remind the firmware about the outstanding submissions. This will be > added in different patch. > > v2: rebase, advertise engine reset availability in platform definition, > add note about GuC submission. > v3: s/*engine_reset*/*reset_engine*/. (Chris) > Handle reset as 2 level resets, by first going to engine only and fall > backing to full/chip reset as needed, i.e. reset_engine will need the > struct_mutex. > v4: Pass the engine mask to i915_reset. (Chris) > v5: Rebase, update selftests. > v6: Rebase, prepare for mutex-less reset engine. > v7: Pass reset_engine mask as a function parameter, and iterate over the > engine mask for reset_engine. (Chris) > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > Signed-off-by: Ian Lister <ian.lister@xxxxxxxxx> > Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > Signed-off-by: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > Signed-off-by: Michel Thierry <michel.thierry@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_drv.c | 15 +++++++++++++++ > drivers/gpu/drm/i915/i915_drv.h | 3 +++ > drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++++++++++--- > drivers/gpu/drm/i915/i915_pci.c | 5 ++++- > drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++ > 5 files changed, 63 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index c7d68e789642..48c8b69d9bde 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv) > if (!test_bit(I915_RESET_HANDOFF, &error->flags)) > return; > > + DRM_DEBUG_DRIVER("resetting chip\n"); This is redundant since we have a "Resetting chip" already here. Just kill it. > + > /* Clear any previous failed attempts at recovery. Time to try again. */ > if (!i915_gem_unset_wedged(dev_priv)) > goto wakeup; > @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv) > goto finish; > } > > +/** > + * i915_reset_engine - reset GPU engine to recover from a hang > + * @engine: engine to reset > + * > + * Reset a specific GPU engine. Useful if a hang is detected. > + * Returns zero on successful reset or otherwise an error code. > + */ > +int i915_reset_engine(struct intel_engine_cs *engine) > +{ > + /* FIXME: replace me with engine reset sequence */ > + return -ENODEV; > +} > + > static int i915_pm_suspend(struct device *kdev) > { > struct pci_dev *pdev = to_pci_dev(kdev); > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index e06af46f5a57..ab7e68626c49 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -814,6 +814,7 @@ struct intel_csr { > func(has_ddi); \ > func(has_decoupled_mmio); \ > func(has_dp_mst); \ > + func(has_reset_engine); \ > func(has_fbc); \ > func(has_fpga_dbg); \ > func(has_full_ppgtt); \ > @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev); > extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); > extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); > extern void i915_reset(struct drm_i915_private *dev_priv); > +extern int i915_reset_engine(struct intel_engine_cs *engine); > +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); > extern int intel_guc_reset(struct drm_i915_private *dev_priv); > extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); > extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index fd97fe00cd0d..3a59ef1367ec 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) > /** > * i915_reset_and_wakeup - do process context error handling work > * @dev_priv: i915 device private > + * @engine_mask: engine(s) hung - for reset-engine only. > * > * Fire an error uevent so userspace can see that a hang or error > * was detected. > */ > -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > +static void > +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask) > { > struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj; > char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; > @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > > kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); > > - DRM_DEBUG_DRIVER("resetting chip\n"); > + /* > + * This event needs to be sent before performing gpu reset. When > + * engine resets are supported we iterate through all engines and > + * reset hung engines individually. To keep the event dispatch > + * mechanism consistent with full gpu reset, this is only sent once > + * even when multiple engines are hung. It is also safe to move this > + * here because when we are in this function, we will definitely > + * perform gpu reset. > + */ > kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); > > + /* try engine reset first, and continue if fails; look mom, no mutex! */ > + if (intel_has_reset_engine(dev_priv)) { > + struct intel_engine_cs *engine; > + unsigned int tmp; > + > + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { > + if (i915_reset_engine(engine) == 0) > + engine_mask &= ~intel_engine_flag(engine); > + } > + > + if (engine_mask) > + DRM_WARN("per-engine reset failed, promoting to full gpu reset\n"); > + else > + goto finish; This will look nicer if we did just try per-engine reset and then quitely promote (it's not that quiet as we do get logging) to global. for_each_engine_masked() {} if (!engine_mask) > + } > + > intel_prepare_reset(dev_priv); > > set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); > @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > kobject_uevent_env(kobj, > KOBJ_CHANGE, reset_done_event); > > +finish: > /* > * Note: The wake_up also serves as a memory barrier so that > * waiters see the updated value of the dev_priv->gpu_error. > @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv, > &dev_priv->gpu_error.flags)) > goto out; > > - i915_reset_and_wakeup(dev_priv); > + i915_reset_and_wakeup(dev_priv, engine_mask); ? You don't need to wakeup the struct_mutex so we don't need this after per-engine resets. Time to split up i915_reset_and_wakeup(), because we certainly shouldn't be calling intel_finish_reset() without first calling intel_prepare_reset(). Which is right here in my tree... > +/* > + * When GuC submission is enabled, GuC manages ELSP and can initiate the > + * engine reset too. For now, fall back to full GPU reset if it is enabled. > + */ > +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) > +{ > + return (dev_priv->info.has_reset_engine && > + !dev_priv->guc.execbuf_client && > + i915.reset == 2); i915.reset >= 2 -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx