Re: [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 17 Jun 2015 16:46:27 +0100

On Wed, Jun 17, 2015 at 05:07:31PM +0200, Daniel Vetter wrote:
> On Wed, Jun 17, 2015 at 03:16:10PM +0100, Chris Wilson wrote:
> > We have gone far off topic.
> > 
> > The question is how we want __i915_wait_request() to handle a wedged
> > GPU.
> > 
> > It currently reports EIO, and my argument is that is wrong wrt to the
> > semantics of the wait completion and that no caller actually cares about
> > EIO from __i915_wait_request().
> > 
> > * Correction: one caller cares!
> > 
> > If we regard a wedged GPU (and in the short term a reset is equally
> > terminal to an outstanding request) then the GPU can no longer be
> > accesing thta request and the wait can be safely completed. Imo it is
> > correct to return 0 in all circumstances. (Reset pending needs to return
> > -EAGAIN if we need to backoff, but for the lockless consumers we can
> > just ignore the reset notification.
> > 
> > That is set-domain, mmioflip, modesetting do not care if the request
> > succeeded, just that it completed.
> > 
> > Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is
> > used by X to detect when the GPU is unusable prior to use, e.g. when
> > waking up, and also during its periodic flushes.
> > 
> > Overlay reports -EIO when turning on and hanging the GPU. To be fair, it
> > can equally report that failure the very next time it touches the ring.
> > 
> > Execbuf itself doesn't rely on wait request reporting EIO, just that we
> > report EIO prior to submitting work o a dead GPU/context. Execbuf uses
> > wait_request via two paths, syncing to an old request on another ring
> > and for flushing requests from the ringbuffer to make room for new
> > commands. This is the tricky part, the only instance where we rely on
> > aborting after waiting but before further operation - we won't even
> > notice a dead GPU prior to starting a request and running out of space
> > otherwise). Since it is the only instance, we can move the terminal
> > detection of a dead GPU from the wait request into the
> > ring_wait_for_space(). This is in keeping with the ethos that we do not
> > report -EIO until we attempt to access the GPU.
> 
> Ok, following up with my side of the irc discussion we've had. I agree
> that there's only 2 places where we must report an EIO if the gpu is
> terminally wedge:
> - throttle
> - execbuf
> 
> How that's done doesn't matter, and when it's racy wrt concurrent gpu
> deaths that also doens't matter, i.e. we don't need wait_request to EIO
> immediately as long as we check terminally_wedged somewhere in these
> ioctls.
> 
> My main concern is that if we remove the EIO from wait_request we'll
> accidentally also remove the EIO from execbuf. And we've had kernels where
> the only EIO left was the wait_request from ring_begin ...

My plan has been to do an is-wedged check on allocating a request
(that's guarranteed to be the first action before writing into a ring),
and then double-check that no events have taken place before submitting
that request (i.e. on finishing the block). To supplement that, we do
the explicit is-wedged check after waiting for ring space.

> But if we add a small igt to manually wedge the gpu through debugfs and
> then check that throttle/execbuf do EIO that risk is averted and I'd be ok
> with eating EIO from wait_request with extreme prejudice. Since indeed we
> still have trouble with EIO at least temporarily totally wreaking modeset
> ioclts and other things that really always should work.

Well, I've successfully wedged the GPU, broke the driver but verified
that throttle reports -EIO. Just needs a little TLC.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx