On Sun, Sep 08, 2013 at 09:57:13PM +0200, Daniel Vetter wrote: > My g33 here seems to be shockingly good at hitting them all. This time > around kms_flip/flip-vs-panning-vs-hang blows up: > > intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and > if a gpu hang is pending aborts the wait for outstanding flips so that > the setcrtc call will succeed and release the crtc mutex. And the gpu > hang handler needs that lock in intel_display_handle_reset to be able > to complete outstanding flips. > > The problem is that we can race in two ways: > - Waiters on the dev_priv->pending_flip_queue aren't woken up after > we've the reset as pending, but before we actually start the reset > work. This means that the waiter doesn't notice the pending reset > and hence will keep on hogging the locks. > > Like with dev->struct_mutex and the ring->irq_queue wait queues we > there need to wake up everyone that potentially holds a lock which > the reset handler needs. > > - intel_display_handle_reset was called _after_ we've already > signalled the completion of the reset work. Which means a waiter > could sneak in, grab the lock and never release it (since the > pageflips won't ever get released). > > Similar to resetting the gem state all the reset work must complete > before we update the reset counter. Contrary to the gem reset we > don't need to have a second explicit wake up call since that will > have happened already when completing the pageflips. We also don't > have any issues that the completion happens while the reset state is > still pending - wait_for_pending_flips is only there to ensure we > display the right frame. After a gpu hang&reset events such > guarantees are out the window anyway. This is in contrast to the gem > code where too-early wake-up would result in unnecessary restarting > of ioctls. > > Also, since we've gotten these various deadlocks and ordering > constraints wrong so often throw copious amounts of comments at the > code. > > This deadlock regression has been introduced in the commit which added > the pageflip reset logic to the gpu hang work: > > commit 96a02917a0131e52efefde49c2784c0421d6c439 > Author: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > Date: Mon Feb 18 19:08:49 2013 +0200 > > drm/i915: Finish page flips and update primary planes after a GPU reset > > v2: > - Add comments to explain how the wake_up serves as memory barriers > for the atomic_t reset counter. > - Improve the comments a bit as suggested by Chris Wilson. > - Extract the wake_up calls before/after the reset into a little > i915_error_wake_up and unconditionally wake up the > pending_flip_queue waiters, again as suggested by Chris Wilson. > > v3: Throw copious amounts of comments at i915_error_wake_up as > suggested by Chris Wilson. > > Cc: stable@xxxxxxxxxxxxxxx > Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxx> Ok, smashed onto -fixes. Thanks a lot for your critical review. -Daniel > --- > drivers/gpu/drm/i915/i915_irq.c | 68 ++++++++++++++++++++++++++++++++--------- > 1 file changed, 54 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 83cce0c..4b91228 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1469,6 +1469,34 @@ static irqreturn_t ironlake_irq_handler(int irq, void *arg) > return ret; > } > > +static void i915_error_wake_up(struct drm_i915_private *dev_priv, > + bool reset_completed) > +{ > + struct intel_ring_buffer *ring; > + int i; > + > + /* > + * Notify all waiters for GPU completion events that reset state has > + * been changed, and that they need to restart their wait after > + * checking for potential errors (and bail out to drop locks if there is > + * a gpu reset pending so that i915_error_work_func can acquire them). > + */ > + > + /* Wake up __wait_seqno, potentially holding dev->struct_mutex. */ > + for_each_ring(ring, dev_priv, i) > + wake_up_all(&ring->irq_queue); > + > + /* Wake up intel_crtc_wait_for_pending_flips, holding crtc->mutex. */ > + wake_up_all(&dev_priv->pending_flip_queue); > + > + /* > + * Signal tasks blocked in i915_gem_wait_for_error that the pending > + * reset state is cleared. > + */ > + if (reset_completed) > + wake_up_all(&dev_priv->gpu_error.reset_queue); > +} > + > /** > * i915_error_work_func - do process context error handling work > * @work: work struct > @@ -1483,11 +1511,10 @@ static void i915_error_work_func(struct work_struct *work) > drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t, > gpu_error); > struct drm_device *dev = dev_priv->dev; > - struct intel_ring_buffer *ring; > char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; > char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; > char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; > - int i, ret; > + int ret; > > kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event); > > @@ -1506,8 +1533,16 @@ static void i915_error_work_func(struct work_struct *work) > kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, > reset_event); > > + /* > + * All state reset _must_ be completed before we update the > + * reset counter, for otherwise waiters might miss the reset > + * pending state and not properly drop locks, resulting in > + * deadlocks with the reset work. > + */ > ret = i915_reset(dev); > > + intel_display_handle_reset(dev); > + > if (ret == 0) { > /* > * After all the gem state is reset, increment the reset > @@ -1528,12 +1563,11 @@ static void i915_error_work_func(struct work_struct *work) > atomic_set(&error->reset_counter, I915_WEDGED); > } > > - for_each_ring(ring, dev_priv, i) > - wake_up_all(&ring->irq_queue); > - > - intel_display_handle_reset(dev); > - > - wake_up_all(&dev_priv->gpu_error.reset_queue); > + /* > + * Note: The wake_up also serves as a memory barrier so that > + * waiters see the update value of the reset counter atomic_t. > + */ > + i915_error_wake_up(dev_priv, true); > } > } > > @@ -1642,8 +1676,6 @@ static void i915_report_and_clear_eir(struct drm_device *dev) > void i915_handle_error(struct drm_device *dev, bool wedged) > { > struct drm_i915_private *dev_priv = dev->dev_private; > - struct intel_ring_buffer *ring; > - int i; > > i915_capture_error_state(dev); > i915_report_and_clear_eir(dev); > @@ -1653,11 +1685,19 @@ void i915_handle_error(struct drm_device *dev, bool wedged) > &dev_priv->gpu_error.reset_counter); > > /* > - * Wakeup waiting processes so that the reset work item > - * doesn't deadlock trying to grab various locks. > + * Wakeup waiting processes so that the reset work function > + * i915_error_work_func doesn't deadlock trying to grab various > + * locks. By bumping the reset counter first, the woken > + * processes will see a reset in progress and back off, > + * releasing their locks and then wait for the reset completion. > + * We must do this for _all_ gpu waiters that might hold locks > + * that the reset work needs to acquire. > + * > + * Note: The wake_up serves as the required memory barrier to > + * ensure that the waiters see the updated value of the reset > + * counter atomic_t. > */ > - for_each_ring(ring, dev_priv, i) > - wake_up_all(&ring->irq_queue); > + i915_error_wake_up(dev_priv, false); > } > > /* > -- > 1.8.4.rc3 > -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx