Re: [PATCH 04/19] drm/i915/gt: Unlock engine-pm after queuing the kernel context switch

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 19 Nov 2019 14:35:04 +0000

On 18/11/2019 23:02, Chris Wilson wrote:
In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
the backend"), I erroneously concluded that we last modify the engine
inside __i915_request_commit() meaning that we could enable concurrent
submission for userspace as we enqueued this request. However, this
falls into a trap with other users of the engine->kernel_context waking
up and submitting their request before the idle-switch is queued, with
the result that the kernel_context is executed out-of-sequence most
likely upsetting the GPU and certainly ourselves when we try to retire
the out-of-sequence requests.

As such we need to hold onto the effective engine->kernel_context mutex
lock (via the engine pm mutex proxy) until we have finish queuing the
request to the engine.

v2: Serialise against concurrent intel_gt_retire_requests()

Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
---
  drivers/gpu/drm/i915/gt/intel_engine_pm.c | 15 +++++++++------
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 3c0f490ff2c7..722d3eec226e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -75,6 +75,7 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
  
  static bool switch_to_kernel_context(struct intel_engine_cs *engine)
  {
+	struct intel_context *ce = engine->kernel_context;
  	struct i915_request *rq;
  	unsigned long flags;
  	bool result = true;
@@ -99,15 +100,13 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
  	 * retiring the last request, thus all rings should be empty and
  	 * all timelines idle.
  	 */
-	flags = __timeline_mark_lock(engine->kernel_context);
+	flags = __timeline_mark_lock(ce);
  
-	rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT);
+	rq = __i915_request_create(ce, GFP_NOWAIT);
  	if (IS_ERR(rq))
  		/* Context switch failed, hope for the best! Maybe reset? */
  		goto out_unlock;
  
-	intel_timeline_enter(i915_request_timeline(rq));
-
  	/* Check again on the next retirement. */
  	engine->wakeref_serial = engine->serial + 1;
  	i915_request_add_active_barriers(rq);
@@ -116,13 +115,17 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
  	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
  	__i915_request_commit(rq);
  
+	__i915_request_queue(rq, NULL);
+
  	/* Release our exclusive hold on the engine */
  	__intel_wakeref_defer_park(&engine->wakeref);
-	__i915_request_queue(rq, NULL);
+
+	/* And finally expose our selfselves to intel_gt_retire_requests() 

ourselves

*/
+	intel_timeline_enter(ce->timeline);

I haven't really managed to follow this.

What are these other clients which can queue requests and what in this 
block prevents them doing so?

The change seems to be moving the queuing to before 
__intel_wakeref_defer_park and intel_timeline_enter to last. Wakeref 
defer extends the engine lifetime until the submitted rq is retired. But 
how is that consider "unlocking"?

Regards,

Tvrtko

  
  	result = false;
  out_unlock:
-	__timeline_mark_unlock(engine->kernel_context, flags);
+	__timeline_mark_unlock(ce, flags);
  	return result;
  }
  

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx