[PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

chris at chris-wilson.co.uk (Chris Wilson) · Wed, 1 May 2013 12:06:15 +0100

On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > If reset fails, the GPU is declared wedged. This ideally should never
> > happen, but very rarely it does. After the GPU is declared wedged, we
> > must allow userspace to continue to use its mapping of bo in order to
> > recover its data (and in some cases in order for memory management to
> > continue unabated). Obviously after the GPU is wedged, no bo are
> > currently accessed by the GPU and so we can complete any waits or domain
> > transitions away from the GPU. Currently, we fail this essential task
> > and instead report EIO and send a SIGBUS to the affected process -
> > causing major loss of data (by killing X or compiz).
> >
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> 
> So I've read again through the reset code and I still don't see how
> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> -EIO eating after wait_rendering looks really suspicious to me.

That's always been a layer of defense against the driver. Alternatively,
throw we can throw a warn into the wait as we shouldn't enter there with
a seqno and a wedged GPU.
 
> Now the other thing is i915_gem_object_wait_
> rendering, that thing loves to throw an -EIO at us. And on a quick
> check your patch misses the one in set_domain_ioctl. We probably need
> to do the same with sw_finish_ioctl. So what about a
> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> the few places we don't want to hear about a dead gpu?

In fact, it turns out that we hardly ever want
i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
that EIO was only ever returned when we tried to issue a command on the
GPU whilst it was terminally wedged - in order to preserve data
integrity.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2bd8d7a..f243e32 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 
        /* GPU is already declared terminally dead, give up. */
        if (i915_terminally_wedged(error))
-               return -EIO;
+               return 0;
 
        /*
         * Only wait 10 seconds for the gpu reset to complete to avoid hanging
@@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
                                               10*HZ);
        if (ret == 0) {
                DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
-               return -EIO;
-       } else if (ret < 0) {
-               return ret;
-       }
+               atomic_set(&error->reset_counter, I915_WEDGED);
+       } else if (ret > 0)
+               ret = 0;
 #undef EXIT_COND
 
-       return 0;
+       return ret;
 }
 
 int i915_mutex_lock_interruptible(struct drm_device *dev)

 
> And if the chances of us breaking bo waiting are too high we can
> always add a few crazy igts which manually wedge the gpu to test them
> and ensure they all work.

-- 
Chris Wilson, Intel Open Source Technology Centre