Re: [RFC] drm/i915: Don't reset on preemptible workloads

"Bartminski, Jakub" <jakub.bartminski@xxxxxxxxx> · Thu, 2 Aug 2018 08:56:10 +0000

On Wed, 2018-08-01 at 15:22 +0100, Chris Wilson wrote:
> Quoting Jakub Bartmiński (2018-08-01 14:56:11)
> > [...]
> > +       /* We have already tried preempting, but the hardware did
> > not react */
> > +       if (engine->hangcheck.try_preempt)
> > +               return false;
> > +
> > +       active_request = i915_gem_find_active_request(engine);
> 
> No ownership here of rq. You either need a new function to return a
> reference, or careful application of rcu.

This is a fair point, I'm assuming without taking ownership here we may
end up with a nasty race?

> > [...]
> > +               /*
> > +                * If the workload is preemptible but its context
> > was closed
> > +                * we force the engine to skip its execution
> > instead.
> > +                */
> > +               if (i915_gem_context_is_closed(active_context))
> > +                       skip_seqno = true;
> 
> Nope. Even if the context is closed, its results may be depended upon
> by others; both GPU and userspace.

While there may be dependencies, we currently break these dependencies
anyway by resetting the entire engine, don't we? Couldn't we signal an
error for the skipped requests as if we reseted the engine and continue
on with the execution of the remaining requests? If we don't reset/skip
we are left with a request that potentially spins forever, with blocked
dependent requests, and since it's basically the halting problem we
have to draw the line somewhere.

> [...]
> So the gist of this is to excuse lowest priority user contexts from
> hangcheck, do we want to go one step further and ask userspace to opt
> in explicitly? 

Making it explicit opt-in was the idea here, doing it by setting the
lowest priority was just a rough way of accomplishing that.

> Though the argument that as long as its preemptible (and isolated) it
> is not acting as DoS so we don't need to enforce a timeout. That also
> suggests that we need not reset until the request is blocking others
> or userspace.

So if I'm understanding correctly you are suggesting that if we have
preemption enabled we could drop the hangcheck entirely, and just reset
if preemption fails?

> As discussed that seems reasonable, but I think this does need to
> actually trigger preemption.
> -Chris

I'm not sure what you mean right here by "actually trigger preemption",
since this patch does trigger the preemption, could you clarify?

As I've mentioned before, one other problem with this patch is that we
can't have more than one "background" request working before the first
one finishes. While this could be solved at hangcheck by putting the
background requests in the back of the background-priority queue
instead of at the front during preemption, this breaks if the
background requests are dependent on each other. These dependencies
could be resolved in the tasklet (only in the case hangcheck was called
so it wouldn't slow us down too much, they could also be resolved in
the hangcheck itself but the timeline could change in between the
hangcheck and the tasklet, which would make doing it correctly more
complicated I think).
- Jakub
Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx