Quoting Chris Wilson (2020-01-21 11:33:36) > Quoting Tvrtko Ursulin (2020-01-21 11:20:36) > > > > On 21/01/2020 10:09, Chris Wilson wrote: > > > If we encounter a hang on a virtual engine, as we process the hang the > > > request may already have been moved back to the virtual engine (we are > > > processing the hang on the physical engine). We need to reclaim the > > > request from the virtual engine so that the locking is consistent and > > > local to the real engine on which we will hold the request for error > > > state capturing. > > > > > > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture") > > > Testcase: igt/gem_exec_balancer/hang > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > > --- > > > drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++ > > > 1 file changed, 20 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > > > index 3a30767ff0c4..a0acf1898c1e 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > > > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine) > > > cap->rq = active_request(cap->rq->context->timeline, cap->rq); > > > GEM_BUG_ON(!cap->rq); > > > > > > + if (cap->rq->engine != engine) { /* preempted virtual engine */ > > > + struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); > > > + unsigned long flags; > > > + > > > + /* > > > + * An unsubmitted request along a virtual engine will > > > + * remain on the active (this) engine until we are able > > > + * to process the context switch away (and so mark the > > > + * context as no longer in flight). That cannot have happened > > > + * yet, otherwise we would not be hanging! > > > + */ > > > + spin_lock_irqsave(&ve->base.active.lock, flags); > > > + GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); > > > + GEM_BUG_ON(ve->request != cap->rq); > > > + ve->request = NULL; > > > + spin_unlock_irqrestore(&ve->base.active.lock, flags); > > > + > > > + cap->rq->engine = engine; > > > + } > > > > Conceptually this I think belongs in execlists_hold, not capture. Since > > hold has the responsibility to hold correctly. > > Yeah, I didn't put it into execlists_hold() as thought that would be > clearer without it, and that this was more a side-effect of > process_csb() vs __unwind_incomplete_requests(), and didn't want to mix > up with the question of engine->active.lock > > Agreed that execlists_capture() does not describe it well, so probably > shouldn't be called directly from here. > > > Then also, this is all about a race with __unwind_incomplete_requests, > > yes? If so would need to be done under the engine->active.lock or this > > can still happen between now and execlists_hold. > > The serialisation here is on process_csb(), since we have prevented the > tasklet being called and own the reset on the engine, we know that > execlists_schedule_out() can't be called on this engine, and so we own > the virtual request. It's not protected by the engine->active.lock. > > So happy with moving it to before the lock in execlists_hold()? Bleugh. Put it in execlists_hold() suggests adding something like static void execlists_hold(struct intel_engine_cs *engine, struct i915_request *rq) { if (cap->rq->engine != engine) { /* preempted virtual engine */ struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); unsigned long flags; /* foo */ GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &engine->gt->reset.flags); /* * An unsubmitted request along a virtual engine will * remain on the active (this) engine until we are able * to process the context switch away (and so mark the * context as no longer in flight). That cannot have happened * yet, otherwise we would not be hanging! */ spin_lock_irqsave(&ve->base.active.lock, flags); GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); GEM_BUG_ON(ve->request != cap->rq); ve->request = NULL; spin_unlock_irqrestore(&ve->base.active.lock, flags); cap->rq->engine = engine; } spin_lock_irq(&engine->active.lock); /* * Transfer this request onto the hold queue to prevent it * being resumbitted to HW (and potentially completed) before we have * released it. Since we may have already submitted following * requests, we need to remove those as well. */ GEM_BUG_ON(i915_request_on_hold(rq)); GEM_BUG_ON(rq->engine != engine); __execlists_hold(rq); spin_unlock_irq(&engine->active.lock); } Writing down the serialisation rules does suggest that execlists_hold() is not as generic as before. (On the other hand, doing so will help with writing the selftest for this bug, and engine->suspend() is a completely different kettle of fish that plays games with engine->active.hold directly.) -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx