On Mon, Jun 08, 2015 at 06:03:23PM +0100, Tomas Elf wrote: > @@ -1239,11 +1271,17 @@ int __i915_wait_request(struct drm_i915_gem_request *req, > > /* We need to check whether any gpu reset happened in between > * the caller grabbing the seqno and now ... */ > - if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) { > + reset_in_progress = > + i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible); > + > + if ((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) || > + reset_in_progress) { > + > /* ... but upgrade the -EAGAIN to an -EIO if the gpu > * is truely gone. */ > - ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible); > - if (ret == 0) > + if (reset_in_progress) > + ret = reset_in_progress; > + else > ret = -EAGAIN; I've had a bit more time to digest, anything that touches this piece of code makes me recoil in horror as, as it currently stands, it is already a buggy piece of code that returns an error in cases where the caller cannot possibly tolerate an error (and in circumstances where it is not actually an error at all). This leads to unfortunates WARNs and SIGBUS which I have been long arguing against and trying to get fixed for about 5 years. Obviously we need the EAGAIN in order to do the struct_mutex backoff in the caller, but there are a class of non-blocking __i915_wait_request users that can just wait until the reset is resolved. Having that support would simplify a number of cases. What scares me most though is the possiblity of an EIO being reported here, those are almost entirely wrong (if the GPU is hung, the request should be aborted and any waits automatically completed). I'd like to be sure that TDR doesn't make EIO handling any worse... -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx