Re: [PATCH v2] drm/i915: Taint (TAINT_DIE) the kernel if the GPU reset fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Quoting Joonas Lahtinen (2017-12-04 13:41:11)
> On Wed, 2017-11-29 at 14:05 +0000, Chris Wilson wrote:
> > History tells us that if we cannot reset the GPU now, we never will. This
> > then impacts everything that is run subsequently. On failing the reset,
> > we mark the driver as wedged, trying to prevent further execution on the
> > GPU, forcing userspace to fallback to using the CPU to update its
> > framebuffers and let the user know what happened.
> > 
> > We also want to go one step further and add a taint to the kernel so that
> > any subsequent faults can be traced back to this failure. This is
> > important for igt, where if the GPU/driver fails we want to reboot and
> > restart testing rather than continue on into oblivion.
> > 
> > TAINT_DIE is colloquially known as "system on fire", which seems
> > appropriate for unresponsive hardware.
> > 
> > v2: Also taint if the recovery fails (again history shows us that is
> > typically fatal).
> > 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=103514
> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> > Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx>
> 
> <SNIP>
> 
> > @@ -1951,6 +1954,19 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
> >       wake_up_bit(&error->flags, I915_RESET_HANDOFF);
> >       return;
> >  
> > +taint:
> > +     /*
> > +      * History tells us that if we cannot reset the GPU now, we
> > +      * never will. This then impacts everything that is run
> > +      * subsequently. On failing the reset, we mark the driver
> > +      * as wedged, preventing further execution on the GPU.
> > +      * We also want to go one step further and add a taint to the
> > +      * kernel so that any subsequent faults can be traced back to
> > +      * this failure. This is important for igt, where if the
> > +      * GPU/driver fails we want to reboot and restart testing
> > +      * rather than continue on into oblivion.
> > +      */
> 
> As Marta mentioned too, How igt works on a given day is bit volatile to
> document in the kernel comments.

More to the point, CI now implements the described response to
TAINT_DIE, without which this is pointless (userspace sees the wedged
and either handles it or dies; CI sees the wedged as a challenge).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux