On Wed, Jun 17, 2015 at 05:07:31PM +0200, Daniel Vetter wrote: > On Wed, Jun 17, 2015 at 03:16:10PM +0100, Chris Wilson wrote: > > We have gone far off topic. > > > > The question is how we want __i915_wait_request() to handle a wedged > > GPU. > > > > It currently reports EIO, and my argument is that is wrong wrt to the > > semantics of the wait completion and that no caller actually cares about > > EIO from __i915_wait_request(). > > > > * Correction: one caller cares! > > > > If we regard a wedged GPU (and in the short term a reset is equally > > terminal to an outstanding request) then the GPU can no longer be > > accesing thta request and the wait can be safely completed. Imo it is > > correct to return 0 in all circumstances. (Reset pending needs to return > > -EAGAIN if we need to backoff, but for the lockless consumers we can > > just ignore the reset notification. > > > > That is set-domain, mmioflip, modesetting do not care if the request > > succeeded, just that it completed. > > > > Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is > > used by X to detect when the GPU is unusable prior to use, e.g. when > > waking up, and also during its periodic flushes. > > > > Overlay reports -EIO when turning on and hanging the GPU. To be fair, it > > can equally report that failure the very next time it touches the ring. > > > > Execbuf itself doesn't rely on wait request reporting EIO, just that we > > report EIO prior to submitting work o a dead GPU/context. Execbuf uses > > wait_request via two paths, syncing to an old request on another ring > > and for flushing requests from the ringbuffer to make room for new > > commands. This is the tricky part, the only instance where we rely on > > aborting after waiting but before further operation - we won't even > > notice a dead GPU prior to starting a request and running out of space > > otherwise). Since it is the only instance, we can move the terminal > > detection of a dead GPU from the wait request into the > > ring_wait_for_space(). This is in keeping with the ethos that we do not > > report -EIO until we attempt to access the GPU. > > Ok, following up with my side of the irc discussion we've had. I agree > that there's only 2 places where we must report an EIO if the gpu is > terminally wedge: > - throttle > - execbuf > > How that's done doesn't matter, and when it's racy wrt concurrent gpu > deaths that also doens't matter, i.e. we don't need wait_request to EIO > immediately as long as we check terminally_wedged somewhere in these > ioctls. > > My main concern is that if we remove the EIO from wait_request we'll > accidentally also remove the EIO from execbuf. And we've had kernels where > the only EIO left was the wait_request from ring_begin ... My plan has been to do an is-wedged check on allocating a request (that's guarranteed to be the first action before writing into a ring), and then double-check that no events have taken place before submitting that request (i.e. on finishing the block). To supplement that, we do the explicit is-wedged check after waiting for ring space. > But if we add a small igt to manually wedge the gpu through debugfs and > then check that throttle/execbuf do EIO that risk is averted and I'd be ok > with eating EIO from wait_request with extreme prejudice. Since indeed we > still have trouble with EIO at least temporarily totally wreaking modeset > ioclts and other things that really always should work. Well, I've successfully wedged the GPU, broke the driver but verified that throttle reports -EIO. Just needs a little TLC. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx