If reset fails, the GPU is declared wedged. This ideally should never happen, but very rarely it does. After the GPU is declared wedged, we must allow userspace to continue to use its mapping of bo in order to recover its data (and in some cases in order for memory management to continue unabated). Obviously after the GPU is wedged, no bo are currently accessed by the GPU and so we can complete any waits or domain transitions away from the GPU. Currently, we fail this essential task and instead report EIO and send a SIGBUS to the affected process - causing major loss of data (by killing X or compiz). Fixes regression from commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9] Author: Daniel Vetter <daniel.vetter@xxxxxxxx> Date: Thu Nov 15 17:17:22 2012 +0100 drm/i915: clear up wedged transitions v2: Add comments. References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> Cc: Damien Lespiau <damien.lespiau@xxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx --- drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 44da25e..ac05845 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -95,9 +95,17 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) if (EXIT_COND) return 0; - /* GPU is already declared terminally dead, give up. */ + /* GPU is already declared terminally dead, nothing to wait for. + * Return and let the ioctl continue. If we bail out here, then + * we report EIO back to userspace (or worse SIGBUS through a + * pagefault) when the caller is not necessarily interacting with + * the device but is instead performing memory management. If the + * application does instead want (or requires) to submit a GPU + * command, then we will report the hung GPU (EIO) when we try + * to acquire space on the ring. + */ if (i915_terminally_wedged(error)) - return -EIO; + return 0; /* * Only wait 10 seconds for the gpu reset to complete to avoid hanging @@ -109,13 +117,17 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) 10*HZ); if (ret == 0) { DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); - return -EIO; - } else if (ret < 0) { - return ret; - } + /* The impossible happened, mark the device as terminally + * wedged so that we fail quicker next time. If the reset + * does eventually complete, the terminally wedged status + * will be confirmed, or the counter reset. + */ + atomic_set(&error->reset_counter, I915_WEDGED); + } else if (ret > 0) + ret = 0; #undef EXIT_COND - return 0; + return ret; } int i915_mutex_lock_interruptible(struct drm_device *dev) @@ -1211,10 +1223,13 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data, /* Try to flush the object off the GPU without holding the lock. * We will repeat the flush holding the lock in the normal manner - * to catch cases where we are gazumped. + * to catch cases where we are gazumped. Also because it is unlocked, + * it is possible for a spurious GPU hang to occur whilst we wait. + * In that event, just continue on and see if it confirmed by the + * locked wait. */ ret = i915_gem_object_wait_rendering__nonblocking(obj, !write_domain); - if (ret) + if (ret && ret != -EIO) goto unref; if (read_domains & I915_GEM_DOMAIN_GTT) { -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html