On Wed, May 1, 2013 at 6:23 PM, Daniel Vetter <daniel at ffwll.ch> wrote: > On Wed, May 1, 2013 at 3:01 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: >> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote: >>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: >>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote: >>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: >>> >> > If reset fails, the GPU is declared wedged. This ideally should never >>> >> > happen, but very rarely it does. After the GPU is declared wedged, we >>> >> > must allow userspace to continue to use its mapping of bo in order to >>> >> > recover its data (and in some cases in order for memory management to >>> >> > continue unabated). Obviously after the GPU is wedged, no bo are >>> >> > currently accessed by the GPU and so we can complete any waits or domain >>> >> > transitions away from the GPU. Currently, we fail this essential task >>> >> > and instead report EIO and send a SIGBUS to the affected process - >>> >> > causing major loss of data (by killing X or compiz). >>> >> > >>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 >>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 >>> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> >>> >> >>> >> So I've read again through the reset code and I still don't see how >>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the >>> >> -EIO eating after wait_rendering looks really suspicious to me. >>> > >>> > That's always been a layer of defense against the driver. Alternatively, >>> > throw we can throw a warn into the wait as we shouldn't enter there with >>> > a seqno and a wedged GPU. >>> >>> Yeah, that sounds like a good idea. We really shouldn't ever have an >>> oustanding request while the gpu is wedged (ignoring dangerous changes >>> to the wedged state through debugfs). >> >> Except it is only true for a locked wait. :( > > Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at > least nonblocking) waits. We'd get that by simply dropping all the > check_wedge calls from the wait functions. That would leave us with a > check_wedge in throttle and intel_ring_begin, which I think are the > two places we really want that. > >>> >> Now the other thing is i915_gem_object_wait_ >>> >> rendering, that thing loves to throw an -EIO at us. And on a quick >>> >> check your patch misses the one in set_domain_ioctl. We probably need >>> >> to do the same with sw_finish_ioctl. So what about a >>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate >>> >> the few places we don't want to hear about a dead gpu? >>> > >>> > In fact, it turns out that we hardly ever want >>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was >>> > that EIO was only ever returned when we tried to issue a command on the >>> > GPU whilst it was terminally wedged - in order to preserve data >>> > integrity. >>> >>> Don't we need to re-add a check in execbuf then? Otherwise I guess >>> userspace has no idea ever that something is amiss ... >> >> It will catch the EIO as soon as it tries to emit the command, but >> adding the explicit check will save a ton of work in reserving buffers. > > Right, I've forgotten about the check in intel_ring_begin. > >>> Otherwise I think this approach would also make sense, and feels more >>> future-proof. Applications that want real robustness simply need to >>> query the kernel through the new interface whether anything ugly has >>> happened to them. And if we want more synchronous feedback we can >>> always improve wait/set_domain_ioctl to check whether the given bo >>> suffered through a little calamity. >> >> The only explicit point we have to remember is to preserve the >> throttle() semantics of reporting EIO when wedged. > > I think plugging the leak in execbuf is still good to avoid queueing > up tons of crap (since we really don't want to do that). But I think > with the -EIO check in intel_ring_begin we are covered. > >>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >>> > index 2bd8d7a..f243e32 100644 >>> > --- a/drivers/gpu/drm/i915/i915_gem.c >>> > +++ b/drivers/gpu/drm/i915/i915_gem.c >>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) >>> > >>> > /* GPU is already declared terminally dead, give up. */ >>> > if (i915_terminally_wedged(error)) >>> > - return -EIO; >>> > + return 0; >>> > >>> > /* >>> > * Only wait 10 seconds for the gpu reset to complete to avoid hanging >>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error) >>> > 10*HZ); >>> > if (ret == 0) { >>> > DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); >>> > - return -EIO; >>> > - } else if (ret < 0) { >>> > - return ret; >>> > - } >>> > + atomic_set(&error->reset_counter, I915_WEDGED); >>> > + } else if (ret > 0) >>> > + ret = 0; >>> >>> Less convinced about this hunk here. I think the right approach would >>> be to simply kill the timeout. Unlucky users can always get out of >>> here with a signal, and it's imo more robust if we have all relevant >>> timeouts for the reset process in the reset work. Separate patch >>> though. Also we now have quite good coverage of all the reset handling >>> in igt, so I don't think we need to go overboard in defending against >>> our own logic screw-ups ... Reset really should work nowadays. >> >> Apart from #63291 said otherwise (iirc)... I think several layers of >> paranoia over handling GPU and driver hangs is justified. > > The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc > with our accounting. But in general I really prefer an angry user with > a stuck-process backtrace than trying to paper over our own > shortcomings with hacks. And at least for the gpu hang vs. > pageflip/set_base we should now be fairly well-covered with tests. And > I haven't yet seen a report indicating that there's another issue > looming ... > > I guess an obnoxious WARN with a return 0; would also work for the > time out case. But I prefer to just ditch the timeout. Poke ... I still think setting wedged here is a bit risky. And maybe add a comment why wait_for_error eats the -EIO (and why intel_ring_begin does not eat the -EIO)? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch