Re: [PATCH] drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 24 May 2013 22:03:14 +0100

On Fri, May 24, 2013 at 09:29:32PM +0200, Daniel Vetter wrote:
> Chris Wilson noticed that since
> 
> commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9]
> Author: Daniel Vetter <daniel.vetter@xxxxxxxx>
> Date:   Thu Nov 15 17:17:22 2012 +0100
> 
>     drm/i915: clear up wedged transitions
> 
> X can again get -EIO when it does not expect it. And even worse score
> a SIGBUS when accessing gtt mmaps. The established ABI is that we
> _only_ return an -EIO from execbuf - all other ioctls should just
> work. And since the reset code moves all bos out of gpu domains and
> clears out all the last_seqno/ring tracking there really shouldn't be
> any reason for non-execbuf code to ever touch the hw and see an -EIO.
> 
> After some extensive discussions we've noticed that these spurios -EIO
> are caused by i915_gem_wait_for_error:
> 
> http://www.mail-archive.com/intel-gfx@xxxxxxxxxxxxxxxxxxxxx/msg20540.html
> 
> That is easy to fix by returning 0 instead of -EIO, since grabbing the
> dev->struct_mutex does not yet mean that we actually want to touch the
> hw. And so there is no reason at all to fail with -EIO.
> 
> But that's not the entire since, since often (at least it's easily
> googleable) dmesg indicates that the reset fails and we declare the
> gpu wedged. Then, quite a bit later X wakes up with the "Timed out
> waiting for the gpu reset to complete" DRM_ERROR message in
> wait_for_errror and brings down the desktop with an -EIO/SIGBUS.
> 
> So clearly we're missing a wakeup somewhere, since the gpu reset just
> doesn't take 10 seconds to complete. And indeed we're do handle the
> terminally wedged state wrong.
> 
> Fix this all up.
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> Cc: Damien Lespiau <damien.lespiau@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxx>

Definite woosh. I feel silly for missing that.
Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

I still think there is a risk for the non-blocking wait to return an
EIO and papering it over is the simplest approach. The chance that
anyone will ever hit is minimal, and fortunately an EIO should never
actually cause an application with adequate error handling to crash, so
something that we can discuss at leisure.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html