On Sun, Sep 08, 2013 at 02:52:10PM +0200, Daniel Vetter wrote: > My g33 here seems to be shockingly good at hitting them all. This time > around kms_flip/flip-vs-panning-vs-hang blows up: > > intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and > if a gpu hang is pending aborts the wait for outstanding flips so that > the setcrtc call will succeed and release the crtc mutex. And the gpu > hang handler needs that lock in intel_display_handle_reset to be able > to complete outstanding flips. > > The problem is that we can race in two ways: > - Waiters on the dev_priv->pending_flip_queue aren't woken up after > we've the reset as pending, but before we actually start the reset > work. This means that the waiter doesn't notice the pending reset > and hence will keep on hogging the locks. > > Like with dev->struct_mutex and the ring->irq_queue wait queues we > there need to wake up everyone that potentially holds a lock which > the reset handler needs. > > - intel_display_handle_reset was called _after_ we've already > signalled the completion of the reset work. Which means a waiter > could sneak in, grab the lock and never release it (since the > pageflips won't ever get released). > > Similar to resetting the gem state all the reset work must complete > before we update the reset counter. Contrary to the gem reset we > don't need to have a second explicit wake up call since that will > have happened already when completing the pageflips. We also don't > have any issues that the completion happens while the reset state is > still pending - wait_for_pending_flips is only there to ensure we > display the right frame. After a gpu hang&reset events such > guarantees are out the window anyway. This is in contrast to the gem > code where too-early wake-up would result in unnecessary restarting > of ioctls. > > Also, since we've gotten these various deadlocks and ordering > constraints wrong so often throw copious amounts of comments at the > code. > > This deadlock regression has been introduced in the commit which added > the pageflip reset logic to the gpu hang work: > > commit 96a02917a0131e52efefde49c2784c0421d6c439 > Author: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > Date: Mon Feb 18 19:08:49 2013 +0200 > > drm/i915: Finish page flips and update primary planes after a GPU reset > > v2: > - Add comments to explain how the wake_up serves as memory barriers > for the atomic_t reset counter. > - Improve the comments a bit as suggested by Chris Wilson. > - Extract the wake_up calls before/after the reset into a little > i915_error_wake_up and unconditionally wake up the > pending_flip_queue waiters, again as suggested by Chris Wilson. > > Cc: stable@xxxxxxxxxxxxxxx > Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxx> > --- > drivers/gpu/drm/i915/i915_irq.c | 55 ++++++++++++++++++++++++++++++----------- > 1 file changed, 41 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 83cce0c..2331739 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1469,6 +1469,21 @@ static irqreturn_t ironlake_irq_handler(int irq, void *arg) > return ret; > } > > +static void i915_error_wake_up(struct drm_i915_private *dev_priv, > + bool reset_completed) > +{ > + struct intel_ring_buffer *ring; > + int i; Might as well make that memory barrier explicit: smb_mb__after_atomic_inc(); The cost of an additional mb is insignificant here. Something like: /* Notify all waiters for GPU completion events that reset state has * been changed, and that they need to restart their wait after * checking for potential errors. */ I don't mind repeating ourselves as this logic impacts faraway logic in i915_gem and intel_display etc. /* Wakeup __wait_seqno() [dev->struct_mutex] */ > + for_each_ring(ring, dev_priv, i) > + wake_up_all(&ring->irq_queue); > + /* Wakeup intel_crtc_wait_for_pending_flips() [dev->mode_config.lock] */ > + wake_up_all(&dev_priv->pending_flip_queue); > + /* Wakeup our error checking mutex after we have finished reseting the GPU, i915_gem_wait_for_error() [no locks held by the waiters] */ > + if (reset_completed) > + wake_up_all(&dev_priv->gpu_error.reset_queue); > +} With some additional comments to drive home i915_error_wake_up(), Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> -Chris -- Chris Wilson, Intel Open Source Technology Centre -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html