On Wed, Jun 17, 2015 at 03:16:10PM +0100, Chris Wilson wrote: > We have gone far off topic. > > The question is how we want __i915_wait_request() to handle a wedged > GPU. > > It currently reports EIO, and my argument is that is wrong wrt to the > semantics of the wait completion and that no caller actually cares about > EIO from __i915_wait_request(). > > * Correction: one caller cares! > > If we regard a wedged GPU (and in the short term a reset is equally > terminal to an outstanding request) then the GPU can no longer be > accesing thta request and the wait can be safely completed. Imo it is > correct to return 0 in all circumstances. (Reset pending needs to return > -EAGAIN if we need to backoff, but for the lockless consumers we can > just ignore the reset notification. > > That is set-domain, mmioflip, modesetting do not care if the request > succeeded, just that it completed. > > Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is > used by X to detect when the GPU is unusable prior to use, e.g. when > waking up, and also during its periodic flushes. > > Overlay reports -EIO when turning on and hanging the GPU. To be fair, it > can equally report that failure the very next time it touches the ring. > > Execbuf itself doesn't rely on wait request reporting EIO, just that we > report EIO prior to submitting work o a dead GPU/context. Execbuf uses > wait_request via two paths, syncing to an old request on another ring > and for flushing requests from the ringbuffer to make room for new > commands. This is the tricky part, the only instance where we rely on > aborting after waiting but before further operation - we won't even > notice a dead GPU prior to starting a request and running out of space > otherwise). Since it is the only instance, we can move the terminal > detection of a dead GPU from the wait request into the > ring_wait_for_space(). This is in keeping with the ethos that we do not > report -EIO until we attempt to access the GPU. Ok, following up with my side of the irc discussion we've had. I agree that there's only 2 places where we must report an EIO if the gpu is terminally wedge: - throttle - execbuf How that's done doesn't matter, and when it's racy wrt concurrent gpu deaths that also doens't matter, i.e. we don't need wait_request to EIO immediately as long as we check terminally_wedged somewhere in these ioctls. My main concern is that if we remove the EIO from wait_request we'll accidentally also remove the EIO from execbuf. And we've had kernels where the only EIO left was the wait_request from ring_begin ... But if we add a small igt to manually wedge the gpu through debugfs and then check that throttle/execbuf do EIO that risk is averted and I'd be ok with eating EIO from wait_request with extreme prejudice. Since indeed we still have trouble with EIO at least temporarily totally wreaking modeset ioclts and other things that really always should work. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx