[PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

daniel at ffwll.ch (Daniel Vetter) · Wed, 1 May 2013 18:23:49 +0200

On Wed, May 1, 2013 at 3:01 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> >> > If reset fails, the GPU is declared wedged. This ideally should never
>> >> > happen, but very rarely it does. After the GPU is declared wedged, we
>> >> > must allow userspace to continue to use its mapping of bo in order to
>> >> > recover its data (and in some cases in order for memory management to
>> >> > continue unabated). Obviously after the GPU is wedged, no bo are
>> >> > currently accessed by the GPU and so we can complete any waits or domain
>> >> > transitions away from the GPU. Currently, we fail this essential task
>> >> > and instead report EIO and send a SIGBUS to the affected process -
>> >> > causing major loss of data (by killing X or compiz).
>> >> >
>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> >>
>> >> So I've read again through the reset code and I still don't see how
>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>> >> -EIO eating after wait_rendering looks really suspicious to me.
>> >
>> > That's always been a layer of defense against the driver. Alternatively,
>> > throw we can throw a warn into the wait as we shouldn't enter there with
>> > a seqno and a wedged GPU.
>>
>> Yeah, that sounds like a good idea. We really shouldn't ever have an
>> oustanding request while the gpu is wedged (ignoring dangerous changes
>> to the wedged state through debugfs).
>
> Except it is only true for a locked wait. :(

Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at
least nonblocking) waits. We'd get that by simply dropping all the
check_wedge calls from the wait functions. That would leave us with a
check_wedge in throttle and intel_ring_begin, which I think are the
two places we really want that.

>> >> Now the other thing is i915_gem_object_wait_
>> >> rendering, that thing loves to throw an -EIO at us. And on a quick
>> >> check your patch misses the one in set_domain_ioctl. We probably need
>> >> to do the same with sw_finish_ioctl. So what about a
>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>> >> the few places we don't want to hear about a dead gpu?
>> >
>> > In fact, it turns out that we hardly ever want
>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
>> > that EIO was only ever returned when we tried to issue a command on the
>> > GPU whilst it was terminally wedged - in order to preserve data
>> > integrity.
>>
>> Don't we need to re-add a check in execbuf then? Otherwise I guess
>> userspace has no idea ever that something is amiss ...
>
> It will catch the EIO as soon as it tries to emit the command, but
> adding the explicit check will save a ton of work in reserving buffers.

Right, I've forgotten about the check in intel_ring_begin.

>> Otherwise I think this approach would also make sense, and feels more
>> future-proof. Applications that want real robustness simply need to
>> query the kernel through the new interface whether anything ugly has
>> happened to them. And if we want more synchronous feedback we can
>> always improve wait/set_domain_ioctl to check whether the given bo
>> suffered through a little calamity.
>
> The only explicit point we have to remember is to preserve the
> throttle() semantics of reporting EIO when wedged.

I think plugging the leak in execbuf is still good to avoid queueing
up tons of crap (since we really don't want to do that). But I think
with the -EIO check in intel_ring_begin we are covered.

>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> > index 2bd8d7a..f243e32 100644
>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>> >
>> >         /* GPU is already declared terminally dead, give up. */
>> >         if (i915_terminally_wedged(error))
>> > -               return -EIO;
>> > +               return 0;
>> >
>> >         /*
>> >          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>> >                                                10*HZ);
>> >         if (ret == 0) {
>> >                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
>> > -               return -EIO;
>> > -       } else if (ret < 0) {
>> > -               return ret;
>> > -       }
>> > +               atomic_set(&error->reset_counter, I915_WEDGED);
>> > +       } else if (ret > 0)
>> > +               ret = 0;
>>
>> Less convinced about this hunk here. I think the right approach would
>> be to simply kill the timeout. Unlucky users can always get out of
>> here with a signal, and it's imo more robust if we have all relevant
>> timeouts for the reset process in the reset work. Separate patch
>> though. Also we now have quite good coverage of all the reset handling
>> in igt, so I don't think we need to go overboard in defending against
>> our own logic screw-ups ... Reset really should work nowadays.
>
> Apart from #63291 said otherwise (iirc)... I think several layers of
> paranoia over handling GPU and driver hangs is justified.

The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc
with our accounting. But in general I really prefer an angry user with
a stuck-process backtrace than trying to paper over our own
shortcomings with hacks. And at least for the gpu hang vs.
pageflip/set_base we should now be fairly well-covered with tests. And
I haven't yet seen a report indicating that there's another issue
looming ...

I guess an obnoxious WARN with a return 0; would also work for the
time out case. But I prefer to just ditch the timeout.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch