On Thu, Jan 10, 2013 at 02:23:37AM +0000, Chris Wilson wrote: > Once we thought we got semaphores working, we disabled kicking the ring > if hangcheck fired whilst waiting upon a ring as it was doing more harm > than good: > > commit 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f > Author: Daniel Vetter <daniel.vetter at ffwll.ch> > Date: Wed Dec 14 13:56:58 2011 +0100 > > drm/i915: kicking rings stuck on semaphores considered harmful > > However, life is never that easy and semaphores are still causing > problems whereby the value written by one ring (bcs) is not being > propagated to the waiter (rcs). Thus the waiter never wakes up and we > declare the GPU hung, which often has unfortunate consequences, even if > we successfully reset the GPU. > > But the GPU is idle as it has completed the work, just didn't notify its > clients. So we can detect the incomplete wait during hang check and > probe the target ring to see if has indeed emitted the breadcrumb seqno > following the work and then and only then kick the waiter. > > Based on a suggestion by Ben Widawsky. > > References: https://bugs.freedesktop.org/show_bug.cgi?id=54226 > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch> > Cc: Ben Widawsky <ben at bwidawsk.net> > --- > drivers/gpu/drm/i915/i915_irq.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 39e4177..3cc52b2 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1693,6 +1693,32 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err) > return false; > } > > +static bool semaphore_passed(struct intel_ring_buffer *waiter) > +{ > + struct drm_i915_private *dev_priv = waiter->dev->dev_private; > + int acthd = intel_ring_get_active_head(waiter) & HEAD_ADDR; > + struct intel_ring_buffer *signaller; > + u32 cmd; > + > + /* ACTHD is likely pointing to the dword after the actual command, > + * so scan backwards until we find the MBOX. > + */ > +#define WAIT (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER) > + do { > + cmd = ioread32(waiter->virtual_start+acthd); > + if ((cmd & ~(0x3 << 16)) == WAIT) > + break; > + acthd -= 4; > + if (acthd < 0) > + return false; > + } while (1); > +#undef WAIT > + > + signaller = &dev_priv->ring[(waiter->id + (((cmd >> 16) & 3) - 1)) % 3]; I hate you for this line ^^ > + return i915_seqno_passed(signaller->get_seqno(signaller, false), > + ioread32(waiter->virtual_start+acthd+4)+1); > +} > + For the sake of just testing if this makes the problem go away, this could have been a lot simpler. Just check the RCS mbox (since in our discussion, you said all the bugs had been RCS waiting). i915_seqno_passed(bcs->get_seqno, GEN6_RBSYNC) > static bool kick_ring(struct intel_ring_buffer *ring) > { > struct drm_device *dev = ring->dev; > @@ -1704,6 +1730,14 @@ static bool kick_ring(struct intel_ring_buffer *ring) > I915_WRITE_CTL(ring, tmp); > return true; > } > + if (INTEL_INFO(dev)->gen >= 6 && > + tmp & RING_WAIT_SEMAPHORE && > + semaphore_passed(ring)) { > + DRM_ERROR("Kicking stuck semaphore on %s\n", > + ring->name); > + I915_WRITE_CTL(ring, tmp); > + return true; > + } > return false; > } I'd go with just gen == 6, for now. Also, since we know the waiter from your horrifically complex calculation above, maybe we can get that info in the DRM_ERROR as well? On the idea: Acked-by: Ben Widawsky <ben at bwidawsk.net> I'll happily take time to turn it into an r-b if it fixes a bug. -- Ben Widawsky, Intel Open Source Technology Center