Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > Quoting Mika Kuoppala (2019-02-08 14:47:13) >> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: >> >> > If we haven't even begun executing the payload of the stalled request, >> > then we should not claim that its userspace context was guilty of >> > submitting a hanging batch. >> > >> > v2: Check for context corruption before trying to restart. >> > v3: Preserve semaphores on skipping requests (need to keep the timelines >> > intact). >> > >> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> >> > --- >> > drivers/gpu/drm/i915/intel_lrc.c | 42 +++++++++++++++++-- >> > drivers/gpu/drm/i915/selftests/igt_spinner.c | 9 +++- >> > .../gpu/drm/i915/selftests/intel_hangcheck.c | 6 +++ >> > 3 files changed, 53 insertions(+), 4 deletions(-) >> > >> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c >> > index 5e98fd79bd9d..e3134a635926 100644 >> > --- a/drivers/gpu/drm/i915/intel_lrc.c >> > +++ b/drivers/gpu/drm/i915/intel_lrc.c >> > @@ -1387,6 +1387,10 @@ static int gen8_emit_init_breadcrumb(struct i915_request *rq) >> > *cs++ = rq->fence.seqno - 1; >> > >> > intel_ring_advance(rq, cs); >> > + >> > + /* Record the updated position of the request's payload */ >> > + rq->infix = intel_ring_offset(rq, cs); >> > + >> > return 0; >> > } >> > >> > @@ -1878,6 +1882,23 @@ static void execlists_reset_prepare(struct intel_engine_cs *engine) >> > spin_unlock_irqrestore(&engine->timeline.lock, flags); >> > } >> > >> > +static bool lrc_regs_ok(const struct i915_request *rq) >> > +{ >> > + const struct intel_ring *ring = rq->ring; >> > + const u32 *regs = rq->hw_context->lrc_reg_state; >> > + >> > + /* Quick spot check for the common signs of context corruption */ >> > + >> > + if (regs[CTX_RING_BUFFER_CONTROL + 1] != >> > + (RING_CTL_SIZE(ring->size) | RING_VALID)) >> > + return false; >> >> You been noticing this with ctx corruption? Well now >> thinking about it, we have had reports where on init, >> on some trouble, the valid vanishes. > > Yes, it's why we've been copying the default context over guilty for a > long time (pretty much since live_hangcheck became a thing iirc). > >> > + >> > + if (regs[CTX_RING_BUFFER_START + 1] != i915_ggtt_offset(ring->vma)) >> > + return false; >> > + >> >> Seen this on bugzilla reports too. Are there more >> in your sleeve or is this a compromise on complexity >> and performance? Checking on a sane actual head too? > > No, I can't remember seeing this, just loosing CTL. But I do recall that > at one point it seemed the whole reg state was zero, but that is foggy > memory. I might mix this with the failed init where the start was bogus/stale. Looked like after the hang, the hardware just swapped to a previous ctx without any provocation. Regardless of the reason, this will guard the restart. > >> The heuristics of it bothers me some as we will >> get false positives. > > They cannot be false positives! If we restore to a batch setup like > this, it will hang -- which is why we explicitly reset them. > >> So in effect, when we get one, we just move ahead >> after an extra reset as we got it all wrong? > > Yup. The context is corrupt, we replace it with a sane one and hope > nobody notices. Mesa does notice though.... We tell them about our shortcomings, if they choose to listen. Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx