> -----Original Message----- > From: Intel-gfx [mailto:intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf > Of Chris Wilson > Sent: Wednesday, November 29, 2017 4:06 PM > To: intel-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > Subject: [PATCH v2] drm/i915: Taint (TAINT_DIE) the kernel if the > GPU reset fails > > History tells us that if we cannot reset the GPU now, we never will. This then > impacts everything that is run subsequently. On failing the reset, we mark > the driver as wedged, trying to prevent further execution on the GPU, > forcing userspace to fallback to using the CPU to update its framebuffers and > let the user know what happened. > > We also want to go one step further and add a taint to the kernel so that any > subsequent faults can be traced back to this failure. This is important for igt, > where if the GPU/driver fails we want to reboot and restart testing rather > than continue on into oblivion. I am OK with this if the comment deciding what IGT wants to do when this state is collected is removed. I just want to make it clear that the tainting feature is separated from what that should be done when the system is tainted. > > TAINT_DIE is colloquially known as "system on fire", which seems > appropriate for unresponsive hardware. > > v2: Also taint if the recovery fails (again history shows us that is typically > fatal). > > References: https://bugs.freedesktop.org/show_bug.cgi?id=103514 > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_drv.c | 28 ++++++++++++++++++++++------ > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.c > b/drivers/gpu/drm/i915/i915_drv.c index 696d5cdf2779..eb90ddac7f8b > 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -1897,18 +1897,21 @@ void i915_reset(struct drm_i915_private *i915, > unsigned int flags) > disable_irq(i915->drm.irq); > ret = i915_gem_reset_prepare(i915); > if (ret) { > - DRM_ERROR("GPU recovery failed\n"); > + dev_err(i915->drm.dev, "GPU recovery > failed\n"); > intel_gpu_reset(i915, ALL_ENGINES); > - goto error; > + goto taint; > } > > ret = intel_gpu_reset(i915, ALL_ENGINES); > if (ret) { > - if (ret != -ENODEV) > - DRM_ERROR("Failed to reset chip: > %i\n", ret); > - else > + if (ret != -ENODEV) { > + dev_err(i915->drm.dev, > + "Failed to reset > chip: %i\n", ret); > + goto taint; > + } else { > DRM_DEBUG_DRIVER("GPU reset > disabled\n"); > - goto error; > + goto error; > + } > } > > i915_gem_reset(i915); > @@ -1951,6 +1954,19 @@ void i915_reset(struct drm_i915_private *i915, > unsigned int flags) > wake_up_bit(&error->flags, I915_RESET_HANDOFF); > return; > > +taint: > + /* > + * History tells us that if we cannot reset the GPU now, we > + * never will. This then impacts everything that is run > + * subsequently. On failing the reset, we mark the driver > + * as wedged, preventing further execution on the GPU. > + * We also want to go one step further and add a taint to the > + * kernel so that any subsequent faults can be traced back to > + * this failure. This is important for igt, where if the > + * GPU/driver fails we want to reboot and restart testing > + * rather than continue on into oblivion. > + */ > + add_taint(TAINT_DIE, LOCKDEP_STILL_OK); > error: > i915_gem_set_wedged(i915); > i915_gem_retire_requests(i915); > -- > 2.15.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx