On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: > If reset fails, the GPU is declared wedged. This ideally should never > happen, but very rarely it does. After the GPU is declared wedged, we > must allow userspace to continue to use its mapping of bo in order to > recover its data (and in some cases in order for memory management to > continue unabated). Obviously after the GPU is wedged, no bo are > currently accessed by the GPU and so we can complete any waits or domain > transitions away from the GPU. Currently, we fail this essential task > and instead report EIO and send a SIGBUS to the affected process - > causing major loss of data (by killing X or compiz). > > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> So I've read again through the reset code and I still don't see how wait_rendering can ever gives us -EIO once the gpu is dead. So all the -EIO eating after wait_rendering looks really suspicious to me. Now the other thing is i915_gem_object_wait_ rendering, that thing loves to throw an -EIO at us. And on a quick check your patch misses the one in set_domain_ioctl. We probably need to do the same with sw_finish_ioctl. So what about a i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate the few places we don't want to hear about a dead gpu? And if the chances of us breaking bo waiting are too high we can always add a few crazy igts which manually wedge the gpu to test them and ensure they all work. Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch