Quoting Mika Kuoppala (2019-02-08 14:47:13) > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > If we haven't even begun executing the payload of the stalled request, > > then we should not claim that its userspace context was guilty of > > submitting a hanging batch. > > > > v2: Check for context corruption before trying to restart. > > v3: Preserve semaphores on skipping requests (need to keep the timelines > > intact). > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > --- > > drivers/gpu/drm/i915/intel_lrc.c | 42 +++++++++++++++++-- > > drivers/gpu/drm/i915/selftests/igt_spinner.c | 9 +++- > > .../gpu/drm/i915/selftests/intel_hangcheck.c | 6 +++ > > 3 files changed, 53 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > > index 5e98fd79bd9d..e3134a635926 100644 > > --- a/drivers/gpu/drm/i915/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/intel_lrc.c > > @@ -1387,6 +1387,10 @@ static int gen8_emit_init_breadcrumb(struct i915_request *rq) > > *cs++ = rq->fence.seqno - 1; > > > > intel_ring_advance(rq, cs); > > + > > + /* Record the updated position of the request's payload */ > > + rq->infix = intel_ring_offset(rq, cs); > > + > > return 0; > > } > > > > @@ -1878,6 +1882,23 @@ static void execlists_reset_prepare(struct intel_engine_cs *engine) > > spin_unlock_irqrestore(&engine->timeline.lock, flags); > > } > > > > +static bool lrc_regs_ok(const struct i915_request *rq) > > +{ > > + const struct intel_ring *ring = rq->ring; > > + const u32 *regs = rq->hw_context->lrc_reg_state; > > + > > + /* Quick spot check for the common signs of context corruption */ > > + > > + if (regs[CTX_RING_BUFFER_CONTROL + 1] != > > + (RING_CTL_SIZE(ring->size) | RING_VALID)) > > + return false; > > You been noticing this with ctx corruption? Well now > thinking about it, we have had reports where on init, > on some trouble, the valid vanishes. Yes, it's why we've been copying the default context over guilty for a long time (pretty much since live_hangcheck became a thing iirc). > > + > > + if (regs[CTX_RING_BUFFER_START + 1] != i915_ggtt_offset(ring->vma)) > > + return false; > > + > > Seen this on bugzilla reports too. Are there more > in your sleeve or is this a compromise on complexity > and performance? Checking on a sane actual head too? No, I can't remember seeing this, just loosing CTL. But I do recall that at one point it seemed the whole reg state was zero, but that is foggy memory. > The heuristics of it bothers me some as we will > get false positives. They cannot be false positives! If we restore to a batch setup like this, it will hang -- which is why we explicitly reset them. > So in effect, when we get one, we just move ahead > after an extra reset as we got it all wrong? Yup. The context is corrupt, we replace it with a sane one and hope nobody notices. Mesa does notice though.... > > + return true; > > +} > > + > > static void execlists_reset(struct intel_engine_cs *engine, bool stalled) > > { > > struct intel_engine_execlists * const execlists = &engine->execlists; > > @@ -1912,6 +1933,21 @@ static void execlists_reset(struct intel_engine_cs *engine, bool stalled) > > if (!rq) > > goto out_unlock; > > > > + /* > > + * If this request hasn't started yet, e.g. it is waiting on a > > + * semaphore, we need to avoid skipping the request or else we > > + * break the signaling chain. However, if the context is corrupt > > + * the request will not restart and we will be stuck with a wedged > > + * device. It is quite often the case that if we issue a reset > > + * while the GPU is loading the context image, that context image > > + * becomes corrupt. > > + * > > + * Otherwise, if we have not started yet, the request should replay > > + * perfectly and we do not need to flag the result as being erroneous. > > + */ > > + if (!i915_request_started(rq) && lrc_regs_ok(rq)) > > + goto out_unlock; > > + > > /* > > * If the request was innocent, we leave the request in the ELSP > > * and will try to replay it on restarting. The context image may > > @@ -1924,7 +1960,7 @@ static void execlists_reset(struct intel_engine_cs *engine, bool stalled) > > * image back to the expected values to skip over the guilty request. > > */ > > i915_reset_request(rq, stalled); > > - if (!stalled) > > + if (!stalled && lrc_regs_ok(rq)) > > Extend the ctx validator usage for on guilty engines, well why not. We forcibly reset anyway. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx