On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote: > On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: > > If reset fails, the GPU is declared wedged. This ideally should never > > happen, but very rarely it does. After the GPU is declared wedged, we > > must allow userspace to continue to use its mapping of bo in order to > > recover its data (and in some cases in order for memory management to > > continue unabated). Obviously after the GPU is wedged, no bo are > > currently accessed by the GPU and so we can complete any waits or domain > > transitions away from the GPU. Currently, we fail this essential task > > and instead report EIO and send a SIGBUS to the affected process - > > causing major loss of data (by killing X or compiz). > > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 > > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> > > So I've read again through the reset code and I still don't see how > wait_rendering can ever gives us -EIO once the gpu is dead. So all the > -EIO eating after wait_rendering looks really suspicious to me. That's always been a layer of defense against the driver. Alternatively, throw we can throw a warn into the wait as we shouldn't enter there with a seqno and a wedged GPU. > Now the other thing is i915_gem_object_wait_ > rendering, that thing loves to throw an -EIO at us. And on a quick > check your patch misses the one in set_domain_ioctl. We probably need > to do the same with sw_finish_ioctl. So what about a > i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate > the few places we don't want to hear about a dead gpu? In fact, it turns out that we hardly ever want i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was that EIO was only ever returned when we tried to issue a command on the GPU whilst it was terminally wedged - in order to preserve data integrity. diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 2bd8d7a..f243e32 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) /* GPU is already declared terminally dead, give up. */ if (i915_terminally_wedged(error)) - return -EIO; + return 0; /* * Only wait 10 seconds for the gpu reset to complete to avoid hanging @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) 10*HZ); if (ret == 0) { DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); - return -EIO; - } else if (ret < 0) { - return ret; - } + atomic_set(&error->reset_counter, I915_WEDGED); + } else if (ret > 0) + ret = 0; #undef EXIT_COND - return 0; + return ret; } int i915_mutex_lock_interruptible(struct drm_device *dev) > And if the chances of us breaking bo waiting are too high we can > always add a few crazy igts which manually wedge the gpu to test them > and ensure they all work. -- Chris Wilson, Intel Open Source Technology Centre