Re: [PATCH v3] drm/i915/execlists: Reclaim the hanging virtual request

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 21 Jan 2020 17:43:37 +0000

On 21/01/2020 17:32, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2020-01-21 17:19:52)

On 21/01/2020 14:07, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2020-01-21 13:55:29)

On 21/01/2020 13:04, Chris Wilson wrote:
+             GEM_BUG_ON(!reset_in_progress(&engine->execlists));
+
+             /*
+              * An unsubmitted request along a virtual engine will
+              * remain on the active (this) engine until we are able
+              * to process the context switch away (and so mark the
+              * context as no longer in flight). That cannot have happened
+              * yet, otherwise we would not be hanging!
+              */
+             spin_lock_irqsave(&ve->base.active.lock, flags);
+             GEM_BUG_ON(intel_context_inflight(rq->context) != engine);
+             GEM_BUG_ON(ve->request != rq);
+             ve->request = NULL;
+             spin_unlock_irqrestore(&ve->base.active.lock, flags);
+
+             rq->engine = engine;

Lets see I understand this... tasklet has been disabled and ring paused.
But we find an uncompleted request in the ELSP context, with rq->engine
== virtual engine. Therefore this cannot be the first request on this
timeline but has to be later.

Not quite.

engine->execlists.active[] tracks the HW, it get's updated only upon
receiving HW acks (or we reset).

So if execlists_active()->engine == virtual, it can only mean that the
inflight _hanging_ request has already been unsubmitted by an earlier
preemption in execlists_dequeue(), but that preemption has not yet been
processed by the HW. (Hence the preemption-reset underway.)

Now while we coalesce the requests for a context into a single ELSP[]
slot, and only record the last request submitted for a context, we have
to walk back along that context's timeline to find the earliest
incomplete request and blame the hang upon it.

For a virtual engine, it's much simpler as there is only ever one
request in flight, but I don't think that has any impact here other
than that we only need to repair the single unsubmitted request that was
returned to the virtual engine.

One which has been put on the runqueue but
not yet submitted to hw. (Because one at a time.) Or it has been
unsubmitted by __unwind_incomplete_request already. In the former case
why move it to the physical engine? Also in the latter actually, it
would overwrite rq->engine with the physical one.

Yes. For incomplete preemption event, the request is *still* on this
engine and has not been released (rq->context->inflight == engine, so it
cannot be submitted to any other engine, until after we acknowledge the
context has been saved and is no longer being accessed by HW.) It is
legal for us to process the hanging request along this engine; we have a
suboptimal decision to return the request to the same engine after the
reset, but since we have replaced the hanging payload, the request is a
mere signaling placeholder (and I do not think will overly burden the
system and negatively impact other virtual engines).

What if the request in elsp actually completed in the meantime eg.
preemption timeout was a false positive?

In execlists_capture we do:

         cap->rq = execlists_active(&engine->execlists);

This gets a completed request, then we do:

         cap->rq = active_request(cap->rq->context->timeline, cap->rq);

This walks along the virtual timeline and finds a next virtual request.
It then binds this request to a physical engine and sets ve->request to
NULL.

If we miss the completion event, then active_request() returns the
original request and we blame it for a having a 650ms preemption-off
shader with a 640ms preemption timeout.

I am thinking of this sequence of interleaved events:

	preempt_timeout
				tasklet_disable
				ring_pause
				execlist_active
	seqno write visible
				active_request - walks the tl to next
				execlist_hold
				schedule_worker
				tasklet_enable
	process_csb completed

This is not possible? Seqno write happening needs only to be roughly 
there since as long as tasklet has been disabled execlist->active 
remains fixed.

Then on unhold ve->submit_notify is called which sets ve->request to
this request but the rq->engine points to the physical engine.

We don't call ve->submit_notify() on unhold, we put it back into our
local priority queue. Keeping ownership of the request on the local
engine seemed to the easiest way to keep track of the locking, and
re-submitting the guilty request on the same engine should not be an
issue.

True, I am jumping between different things and have confused this bit.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx