On Fri, Feb 19, 2016 at 04:09:03PM +0200, Mika Kuoppala wrote: > If we have runaway head moving out of allocated address space, > that space is mapped to point into scratch page. The content of scratch > page is is zero (MI_NOOP). This leads to actual head proceeding > unhindered towards the end of the address space and with with 64 bit > vmas it is a long walk. > > We could inspect ppgtts to see if acthd is on valid space. But > that would need a lock over active vma list and we have tried very > hard to keep hangcheck lockfree. > > Take note of our current global highest vma address, when objects > are added to active list. And check against this address when > hangcheck is run. This is more coarse than per ppgtt inspection, > but it should work of finding runaway head. > > Testcase: igt/drv_hangman/ppgtt_walk > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx> > Signed-off-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_debugfs.c | 2 ++ > drivers/gpu/drm/i915/i915_gem.c | 4 ++++ > drivers/gpu/drm/i915/i915_irq.c | 7 +++++++ > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + > 4 files changed, 14 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c > index 9e19cf0e7075..b6735ae32997 100644 > --- a/drivers/gpu/drm/i915/i915_debugfs.c > +++ b/drivers/gpu/drm/i915/i915_debugfs.c > @@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) > (long long)acthd[i]); > seq_printf(m, "\tmax ACTHD = 0x%08llx\n", > (long long)ring->hangcheck.max_acthd); > + seq_printf(m, "\tmax vma = 0x%08llx\n", > + (long long)ring->hangcheck.max_active_vma); > seq_printf(m, "\tscore = %d\n", ring->hangcheck.score); > seq_printf(m, "\taction = %d\n", ring->hangcheck.action); > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index f68f34606f2f..331b7a3d4206 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma, > list_move_tail(&obj->ring_list[ring->id], &ring->active_list); > i915_gem_request_assign(&obj->last_read_req[ring->id], req); > > + if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma) > + ring->hangcheck.max_active_vma = > + vma->node.start + vma->node.size; No. > + > list_move_tail(&vma->mm_list, &vma->vm->active_list); > } > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 25a89373df63..e59817328971 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd) > sizeof(ring->hangcheck.instdone)); > > if (acthd > ring->hangcheck.max_acthd) { > + u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma); > + > ring->hangcheck.max_acthd = acthd; > + > + if (max_vma && acthd > max_vma) > + return HANGCHECK_HUNG; > + > return HANGCHECK_ACTIVE; The issue is that the active-loop detection completely nullifies the DoS detection, what I thought I sent in Dec/Jan was a tweak to always increment for loops: commit 7ea8bea0738d6a2449f1819d8ee6efd402ae7a6c Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Date: Fri Jan 8 18:07:39 2016 +0000 drm/i915/hangcheck: Prevent long walks across full-ppgtt With full-ppgtt, it takes the GPU an eon to traverse the entire 256PiB address space, so causing a loop to be detected. Under the current scheme, if ACTHD walks off the end of a batch buffer and into an empty address space, we "never" detect the hang. If we always increment the score as the ACTHD is progressing then we will eventually timeout (after ~60s without advancing onto a new batch). To counter act this, increase the amount we reduce the score for good batches, so that only a series of almost-bad batches trigger a full reset. Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 8e449772b1bb..4fc4fbe4b798 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2961,21 +2961,15 @@ static enum intel_engine_hangcheck_action head_stuck(struct intel_engine_cs *ring, u64 acthd) { if (acthd != ring->hangcheck.acthd) { - /* Clear subunit states on head movement */ memset(ring->hangcheck.instdone, 0, sizeof(ring->hangcheck.instdone)); - if (acthd > ring->hangcheck.max_acthd) { - ring->hangcheck.max_acthd = acthd; - return HANGCHECK_ACTIVE; - } - return HANGCHECK_ACTIVE_LOOP; } if (!subunits_stuck(ring)) - return HANGCHECK_ACTIVE; + return HANGCHECK_ACTIVE_LOOP; return HANGCHECK_HUNG; } @@ -3142,7 +3136,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work) * attempts across multiple batches. */ if (ring->hangcheck.score > 0) - ring->hangcheck.score--; + ring->hangcheck.score -= HUNG; + if (ring->hangcheck.score < 0) + ring->hangcheck.score = 0; /* Clear head and subunit states on seqno movement */ ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0; -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx