Re: [PATCH 14/15] drm/i915/guc: Keep the execbuf client allocated across reset

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Mon, 28 Nov 2016 16:01:47 +0000

On Mon, Nov 28, 2016 at 03:44:41PM +0000, Tvrtko Ursulin wrote:
> 
> On 28/11/2016 14:11, Chris Wilson wrote:
> >On Mon, Nov 28, 2016 at 01:49:03PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 25/11/2016 09:30, Chris Wilson wrote:
> >>>In order to avoid some complexity in trying to reconstruct the
> >>>workqueues across reset, remember them instead. The issue comes when we
> >>>have to handle a reset between request allocation and submission, the
> >>>request has reserved space in the wq, but is not in any list so we fail
> >>>to restore the reserved space. By keeping the execbuf client intact
> >>>across the reset, we also keep the reservations.
> >>
> >>I lost track a bit on why do we need to reserve the space at request
> >>creation time? Is it not becoming a bit cumbersome?
> >
> >It is very, very hard to handle a failure. We have to be careful not to
> >alter global state prior to executing the request, or at least
> >submitting the request, which we are currently not. Since we can't
> >unwind the global state changes, that imposes a point-of-no-return on
> >request construction after which, the request must be submitted. (It is
> >possible though to detect when a request doesn't make any global state
> >changes and drop the request on add.) As the reservation may fail, we
> >have to do that early.
> 
> We couldn't just not do any of the ring buffer writing at execbuf
> time, just add it to the appropriate timeline and do all of that
> later, when the scheduler decides it is time to actually submit it?

It's the fence/request tracking that is the hard part, which ties into
the active tracking required to hold the GTT reservation for the request
(among many other external factors).

> >>>@@ -1484,6 +1483,9 @@ int i915_guc_submission_init(struct drm_i915_private *dev_priv)
> >>>	struct intel_guc *guc = &dev_priv->guc;
> >>>	struct i915_vma *vma;
> >>>
> >>>+	if (!HAS_GUC_SCHED(dev_priv))
> >>>+		return 0;
> >>
> >>Why did you have to add this hunk? I think this function does not
> >>get called unless there is a GuC.
> >
> >I too thought that it would not called without a guc.
> 
> But it is or what?

Yes. It was clobbering hw state on gen3-gen5 (CI had an ilk fail, I had
a pnv fail).

[snip]

> Patch itself looked fine to me. It was just complexity which made me
> wonder if we couldn't have taken a different route. But that was
> long time ago so not relevant to this patch anyway.

It's still a valid question. I think the current no-fail request
handling is the simpler approach atm, but it is almost certain that some
point we may need a new scheme (either for greater concurrency, late
changes to the requests, something even more wacky). I'd just like to
fix the last few remaining bugs in the state handling (legacy switch
context) before embarking on a new journey.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx