Re: [PATCH] drm/i915/execlists: Reclaim hanging virtual request

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 21 Jan 2020 11:20:36 +0000

On 21/01/2020 10:09, Chris Wilson wrote:
If we encounter a hang on a virtual engine, as we process the hang the
request may already have been moved back to the virtual engine (we are
processing the hang on the physical engine). We need to reclaim the
request from the virtual engine so that the locking is consistent and
local to the real engine on which we will hold the request for error
state capturing.

Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
Testcase: igt/gem_exec_balancer/hang
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
---
  drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++
  1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 3a30767ff0c4..a0acf1898c1e 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine)
  	cap->rq = active_request(cap->rq->context->timeline, cap->rq);
  	GEM_BUG_ON(!cap->rq);
  
+	if (cap->rq->engine != engine) { /* preempted virtual engine */
+		struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
+		unsigned long flags;
+
+		/*
+		 * An unsubmitted request along a virtual engine will
+		 * remain on the active (this) engine until we are able
+		 * to process the context switch away (and so mark the
+		 * context as no longer in flight). That cannot have happened
+		 * yet, otherwise we would not be hanging!
+		 */
+		spin_lock_irqsave(&ve->base.active.lock, flags);
+		GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
+		GEM_BUG_ON(ve->request != cap->rq);
+		ve->request = NULL;
+		spin_unlock_irqrestore(&ve->base.active.lock, flags);
+
+		cap->rq->engine = engine;
+	}

Conceptually this I think belongs in execlists_hold, not capture. Since 
hold has the responsibility to hold correctly.

Then also, this is all about a race with __unwind_incomplete_requests, 
yes? If so would need to be done under the engine->active.lock or this 
can still happen between now and execlists_hold.

Regards,

Tvrtko

+
  	/*
  	 * Remove the request from the execlists queue, and take ownership
  	 * of the request. We pass it to our worker who will _slowly_ compress

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx