Re: [PATCH] drm/i915: Prevent bonded requests from overtaking each other on preemption

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 20 Sep 2019 13:24:47 +0100

On 20/09/2019 09:36, Chris Wilson wrote:
Force bonded requests to run on distinct engines so that they cannot be
shuffled onto the same engine where timeslicing will reverse the order.
A bonded request will often wait on a semaphore signaled by its master,
creating an implicit dependency -- if we ignore that implicit dependency
and allow the bonded request to run on the same engine and before its
master, we will cause a GPU hang.

We can prevent this inversion by restricting which engines we allow
ourselves to jump to upon preemption, i.e. baking in the arrangement
established at first execution. (We should also consider capturing the
implicit dependency using i915_sched_add_dependency(), but first we need
to think about the constraints that requires on the execution/retirement
ordering.)

Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
References: ee1136908e9b ("drm/i915/execlists: Virtual engine bonding")
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
---
  drivers/gpu/drm/i915/gt/intel_lrc.c | 19 +++++++++++--------
  1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index a99166a2d2eb..7920649e4d87 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -3755,18 +3755,21 @@ static void
  virtual_bond_execute(struct i915_request *rq, struct dma_fence *signal)
  {
  	struct virtual_engine *ve = to_virtual_engine(rq->engine);
+	intel_engine_mask_t allowed, exec;
  	struct ve_bond *bond;
  
  	bond = virtual_find_bond(ve, to_request(signal)->engine);
-	if (bond) {
-		intel_engine_mask_t old, new, cmp;
+	if (!bond)
+		return;
  
-		cmp = READ_ONCE(rq->execution_mask);
-		do {
-			old = cmp;
-			new = cmp & bond->sibling_mask;
-		} while ((cmp = cmpxchg(&rq->execution_mask, old, new)) != old);
-	}
+	/* Restrict the bonded request to run on only the slaved engines */
+	allowed = bond->sibling_mask & ~to_request(signal)->engine->mask;

Hmm.. isn't it a miss on the uapi level that we allow master to be 
mentioned in the list of bonds? That's the only scenario where this line 
does something I think. So should we just forbid this setup on the uapi 
level?

+	exec = READ_ONCE(rq->execution_mask);
+	while (!try_cmpxchg(&rq->execution_mask, &exec, exec & allowed))
+		;
+
+	/* Prevent the master from being re-run on the slaved engines */
+	to_request(signal)->execution_mask &= ~allowed;

This sounds unfortunate for future scheduling. There shouldn't be a 
fundamental reason why next execution for the master couldn't be on an 
engine which can also be a slave. So if we have:

master
  .veng=vcs0,vcs1
slave
  .veng=vcs0,vcs1
  .bond(master=vcs0, mask=vcs1)
  .bond(master=vcs1, mask=vcs0)

This should be allowed setup but with this change it would fix the 
master to only be one of the options.

Is the real problem that after preemption for timeslicing and subsequent 
re-submit we miss some hooks to re-evaluate the bonded relationship?

I guess looking would be hard to do any peeking from one submission 
tasklet to another (different engines) to check if one of the pair is 
already executing again and so to pick the other end correctly?

I think in practical terms for media this work since they are not 
setting it up like my sketch shows. So it could be just fine in practice 
for current users.

Regards,

Tvrtko

  }
  
  struct intel_context *

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx