[PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

daniel at ffwll.ch (Daniel Vetter) · Fri, 24 May 2013 11:03:06 +0200



On Wed, May 1, 2013 at 6:23 PM, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Wed, May 1, 2013 at 3:01 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
>>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>>> >> > If reset fails, the GPU is declared wedged. This ideally should never
>>> >> > happen, but very rarely it does. After the GPU is declared wedged, we
>>> >> > must allow userspace to continue to use its mapping of bo in order to
>>> >> > recover its data (and in some cases in order for memory management to
>>> >> > continue unabated). Obviously after the GPU is wedged, no bo are
>>> >> > currently accessed by the GPU and so we can complete any waits or domain
>>> >> > transitions away from the GPU. Currently, we fail this essential task
>>> >> > and instead report EIO and send a SIGBUS to the affected process -
>>> >> > causing major loss of data (by killing X or compiz).
>>> >> >
>>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>>> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> >>
>>> >> So I've read again through the reset code and I still don't see how
>>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>>> >> -EIO eating after wait_rendering looks really suspicious to me.
>>> >
>>> > That's always been a layer of defense against the driver. Alternatively,
>>> > throw we can throw a warn into the wait as we shouldn't enter there with
>>> > a seqno and a wedged GPU.
>>>
>>> Yeah, that sounds like a good idea. We really shouldn't ever have an
>>> oustanding request while the gpu is wedged (ignoring dangerous changes
>>> to the wedged state through debugfs).
>>
>> Except it is only true for a locked wait. :(
>
> Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at
> least nonblocking) waits. We'd get that by simply dropping all the
> check_wedge calls from the wait functions. That would leave us with a
> check_wedge in throttle and intel_ring_begin, which I think are the
> two places we really want that.
>
>>> >> Now the other thing is i915_gem_object_wait_
>>> >> rendering, that thing loves to throw an -EIO at us. And on a quick
>>> >> check your patch misses the one in set_domain_ioctl. We probably need
>>> >> to do the same with sw_finish_ioctl. So what about a
>>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>>> >> the few places we don't want to hear about a dead gpu?
>>> >
>>> > In fact, it turns out that we hardly ever want
>>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
>>> > that EIO was only ever returned when we tried to issue a command on the
>>> > GPU whilst it was terminally wedged - in order to preserve data
>>> > integrity.
>>>
>>> Don't we need to re-add a check in execbuf then? Otherwise I guess
>>> userspace has no idea ever that something is amiss ...
>>
>> It will catch the EIO as soon as it tries to emit the command, but
>> adding the explicit check will save a ton of work in reserving buffers.
>
> Right, I've forgotten about the check in intel_ring_begin.
>
>>> Otherwise I think this approach would also make sense, and feels more
>>> future-proof. Applications that want real robustness simply need to
>>> query the kernel through the new interface whether anything ugly has
>>> happened to them. And if we want more synchronous feedback we can
>>> always improve wait/set_domain_ioctl to check whether the given bo
>>> suffered through a little calamity.
>>
>> The only explicit point we have to remember is to preserve the
>> throttle() semantics of reporting EIO when wedged.
>
> I think plugging the leak in execbuf is still good to avoid queueing
> up tons of crap (since we really don't want to do that). But I think
> with the -EIO check in intel_ring_begin we are covered.
>
>>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> > index 2bd8d7a..f243e32 100644
>>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>>> >
>>> >         /* GPU is already declared terminally dead, give up. */
>>> >         if (i915_terminally_wedged(error))
>>> > -               return -EIO;
>>> > +               return 0;
>>> >
>>> >         /*
>>> >          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
>>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>>> >                                                10*HZ);
>>> >         if (ret == 0) {
>>> >                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
>>> > -               return -EIO;
>>> > -       } else if (ret < 0) {
>>> > -               return ret;
>>> > -       }
>>> > +               atomic_set(&error->reset_counter, I915_WEDGED);
>>> > +       } else if (ret > 0)
>>> > +               ret = 0;
>>>
>>> Less convinced about this hunk here. I think the right approach would
>>> be to simply kill the timeout. Unlucky users can always get out of
>>> here with a signal, and it's imo more robust if we have all relevant
>>> timeouts for the reset process in the reset work. Separate patch
>>> though. Also we now have quite good coverage of all the reset handling
>>> in igt, so I don't think we need to go overboard in defending against
>>> our own logic screw-ups ... Reset really should work nowadays.
>>
>> Apart from #63291 said otherwise (iirc)... I think several layers of
>> paranoia over handling GPU and driver hangs is justified.
>
> The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc
> with our accounting. But in general I really prefer an angry user with
> a stuck-process backtrace than trying to paper over our own
> shortcomings with hacks. And at least for the gpu hang vs.
> pageflip/set_base we should now be fairly well-covered with tests. And
> I haven't yet seen a report indicating that there's another issue
> looming ...
>
> I guess an obnoxious WARN with a return 0; would also work for the
> time out case. But I prefer to just ditch the timeout.

Poke ... I still think setting wedged here is a bit risky. And maybe
add a comment why wait_for_error eats the -EIO (and why
intel_ring_begin does not eat the -EIO)?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch