On Fri, May 24, 2013 at 10:03:14PM +0100, Chris Wilson wrote: > On Fri, May 24, 2013 at 09:29:32PM +0200, Daniel Vetter wrote: > > Chris Wilson noticed that since > > > > commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9] > > Author: Daniel Vetter <daniel.vetter at ffwll.ch> > > Date: Thu Nov 15 17:17:22 2012 +0100 > > > > drm/i915: clear up wedged transitions > > > > X can again get -EIO when it does not expect it. And even worse score > > a SIGBUS when accessing gtt mmaps. The established ABI is that we > > _only_ return an -EIO from execbuf - all other ioctls should just > > work. And since the reset code moves all bos out of gpu domains and > > clears out all the last_seqno/ring tracking there really shouldn't be > > any reason for non-execbuf code to ever touch the hw and see an -EIO. > > > > After some extensive discussions we've noticed that these spurios -EIO > > are caused by i915_gem_wait_for_error: > > > > http://www.mail-archive.com/intel-gfx at lists.freedesktop.org/msg20540.html > > > > That is easy to fix by returning 0 instead of -EIO, since grabbing the > > dev->struct_mutex does not yet mean that we actually want to touch the > > hw. And so there is no reason at all to fail with -EIO. > > > > But that's not the entire since, since often (at least it's easily > > googleable) dmesg indicates that the reset fails and we declare the > > gpu wedged. Then, quite a bit later X wakes up with the "Timed out > > waiting for the gpu reset to complete" DRM_ERROR message in > > wait_for_errror and brings down the desktop with an -EIO/SIGBUS. > > > > So clearly we're missing a wakeup somewhere, since the gpu reset just > > doesn't take 10 seconds to complete. And indeed we're do handle the > > terminally wedged state wrong. > > > > Fix this all up. > > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 > > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 > > Cc: Chris Wilson <chris at chris-wilson.co.uk> > > Cc: Daniel Vetter <daniel.vetter at ffwll.ch> > > Cc: Damien Lespiau <damien.lespiau at intel.com> > > Cc: stable at vger.kernel.org > > Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch> > > Definite woosh. I feel silly for missing that. > Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk> Merged to -fixes, thanks for the review. > I still think there is a risk for the non-blocking wait to return an > EIO and papering it over is the simplest approach. The chance that > anyone will ever hit is minimal, and fortunately an EIO should never > actually cause an application with adequate error handling to crash, so > something that we can discuss at leisure. Yeah, now that we have a less hand-wavey explanation for those -EIO we can forget about the reset timeout until the next user screams about X dying untimely ;-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch