On Wed, May 1, 2013 at 3:01 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: > On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote: >> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: >> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote: >> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: >> >> > If reset fails, the GPU is declared wedged. This ideally should never >> >> > happen, but very rarely it does. After the GPU is declared wedged, we >> >> > must allow userspace to continue to use its mapping of bo in order to >> >> > recover its data (and in some cases in order for memory management to >> >> > continue unabated). Obviously after the GPU is wedged, no bo are >> >> > currently accessed by the GPU and so we can complete any waits or domain >> >> > transitions away from the GPU. Currently, we fail this essential task >> >> > and instead report EIO and send a SIGBUS to the affected process - >> >> > causing major loss of data (by killing X or compiz). >> >> > >> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 >> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 >> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> >> >> >> >> So I've read again through the reset code and I still don't see how >> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the >> >> -EIO eating after wait_rendering looks really suspicious to me. >> > >> > That's always been a layer of defense against the driver. Alternatively, >> > throw we can throw a warn into the wait as we shouldn't enter there with >> > a seqno and a wedged GPU. >> >> Yeah, that sounds like a good idea. We really shouldn't ever have an >> oustanding request while the gpu is wedged (ignoring dangerous changes >> to the wedged state through debugfs). > > Except it is only true for a locked wait. :( Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at least nonblocking) waits. We'd get that by simply dropping all the check_wedge calls from the wait functions. That would leave us with a check_wedge in throttle and intel_ring_begin, which I think are the two places we really want that. >> >> Now the other thing is i915_gem_object_wait_ >> >> rendering, that thing loves to throw an -EIO at us. And on a quick >> >> check your patch misses the one in set_domain_ioctl. We probably need >> >> to do the same with sw_finish_ioctl. So what about a >> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate >> >> the few places we don't want to hear about a dead gpu? >> > >> > In fact, it turns out that we hardly ever want >> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was >> > that EIO was only ever returned when we tried to issue a command on the >> > GPU whilst it was terminally wedged - in order to preserve data >> > integrity. >> >> Don't we need to re-add a check in execbuf then? Otherwise I guess >> userspace has no idea ever that something is amiss ... > > It will catch the EIO as soon as it tries to emit the command, but > adding the explicit check will save a ton of work in reserving buffers. Right, I've forgotten about the check in intel_ring_begin. >> Otherwise I think this approach would also make sense, and feels more >> future-proof. Applications that want real robustness simply need to >> query the kernel through the new interface whether anything ugly has >> happened to them. And if we want more synchronous feedback we can >> always improve wait/set_domain_ioctl to check whether the given bo >> suffered through a little calamity. > > The only explicit point we have to remember is to preserve the > throttle() semantics of reporting EIO when wedged. I think plugging the leak in execbuf is still good to avoid queueing up tons of crap (since we really don't want to do that). But I think with the -EIO check in intel_ring_begin we are covered. >> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >> > index 2bd8d7a..f243e32 100644 >> > --- a/drivers/gpu/drm/i915/i915_gem.c >> > +++ b/drivers/gpu/drm/i915/i915_gem.c >> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) >> > >> > /* GPU is already declared terminally dead, give up. */ >> > if (i915_terminally_wedged(error)) >> > - return -EIO; >> > + return 0; >> > >> > /* >> > * Only wait 10 seconds for the gpu reset to complete to avoid hanging >> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) >> > 10*HZ); >> > if (ret == 0) { >> > DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); >> > - return -EIO; >> > - } else if (ret < 0) { >> > - return ret; >> > - } >> > + atomic_set(&error->reset_counter, I915_WEDGED); >> > + } else if (ret > 0) >> > + ret = 0; >> >> Less convinced about this hunk here. I think the right approach would >> be to simply kill the timeout. Unlucky users can always get out of >> here with a signal, and it's imo more robust if we have all relevant >> timeouts for the reset process in the reset work. Separate patch >> though. Also we now have quite good coverage of all the reset handling >> in igt, so I don't think we need to go overboard in defending against >> our own logic screw-ups ... Reset really should work nowadays. > > Apart from #63291 said otherwise (iirc)... I think several layers of > paranoia over handling GPU and driver hangs is justified. The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc with our accounting. But in general I really prefer an angry user with a stuck-process backtrace than trying to paper over our own shortcomings with hacks. And at least for the gpu hang vs. pageflip/set_base we should now be fairly well-covered with tests. And I haven't yet seen a report indicating that there's another issue looming ... I guess an obnoxious WARN with a return 0; would also work for the time out case. But I prefer to just ditch the timeout. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch