On Sun, Sep 08, 2013 at 01:30:17PM +0200, Daniel Vetter wrote: > My g33 here seems to be shockingly good at hitting them all. This time > around kms_flip/flip-vs-panning-vs-hang blows up: > > intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and > if a gpu hang is pending aborts the wait for outstanding flips so that > the setcrtc call will succeed and release the crtc mutex. And the gpu > hang handler needs that lock in intel_display_handle_reset to be able > to complete outstanding flips. > > The problem is that we can race in two ways: > - Waiters on the dev_priv->pending_flip_queue aren't woken up after > we've the reset as pending, but before we actually start the reset > work. This means that the waiter doesn't notice the pending reset > and hence will keep on hogging the locks. > > Like with dev->struct_mutex and the ring->irq_queue wait queues we > there need to wake up everyone that potentially holds a lock which > the reset handler needs. > > - intel_display_handle_reset was called _after_ we've already > signalled the completion of the reset work. Which means a waiter > could sneak in, grab the lock and never release it (since the > pageflips won't ever get released). > > Similar to resetting the gem state all the reset work must complete > before we update the reset counter. Contrary to the gem reset we > don't need to have a second explicit wake up call since that will > have happened already when completing the pageflips. We also don't > have any issues that the completion happens while the reset state is > still pending - wait_for_pending_flips is only there to ensure we > display the right frame. After a gpu hang&reset events such > guarantees are out the window anyway. This is in contrast to the gem > code where too-early wake-up would result in unnecessary restarting > of ioctls. I would favour the redundant wake_up() in order to move the common pre/post wake_ups to a common function: i915_error_wake_up() ? > > Also, since we've gotten these various deadlocks and ordering > constraints wrong so often throw copious amounts of comments at the > code. > > This deadlock regression has been introduced in the commit which added > the pageflip reset logic to the gpu hang work: > > commit 96a02917a0131e52efefde49c2784c0421d6c439 > Author: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > Date: Mon Feb 18 19:08:49 2013 +0200 > > drm/i915: Finish page flips and update primary planes after a GPU reset > > Cc: stable@xxxxxxxxxxxxxxx > Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxx> > --- > drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 83cce0c..4fc54cb 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1506,8 +1506,16 @@ static void i915_error_work_func(struct work_struct *work) > kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, > reset_event); > > + /* > + * All state reset _must_ have completed before we update the _must_ be completed Make it active, make it a command! > + * reset counter, for otherwise waiters might miss the reset > + * pending state and not properly drop locks, resuling in resulting > + * deadlocks with the reset work. > + */ > ret = i915_reset(dev); > > + intel_display_handle_reset(dev); > + > if (ret == 0) { > /* > * After all the gem state is reset, increment the reset > @@ -1531,7 +1539,11 @@ static void i915_error_work_func(struct work_struct *work) > for_each_ring(ring, dev_priv, i) > wake_up_all(&ring->irq_queue); > > - intel_display_handle_reset(dev); > + /* > + * intel_display_handle_reset already wakes up all waiters on > + * the dev_priv->pending_flip_queue by (eventually) calling down > + * into do_finish_flip. So no need to wake them up again. > + */ I'm favouring dropping this comment, since calling wake_up() after the flips have already completed should be a no-op. i915_error_wake_up(); > > wake_up_all(&dev_priv->gpu_error.reset_queue); > } > @@ -1654,10 +1666,14 @@ void i915_handle_error(struct drm_device *dev, bool wedged) > > /* > * Wakeup waiting processes so that the reset work item Care to change this to explicitly reference i915_error_work_func() instead of "reset work item". Wakeup processes waiting for completion events from the GPU so that the the reset work item, i915_error_work_func(), doesn't deadlock trying to grab the various locks being they hold. By bumping the reset counter first, the woken processes will see a reset in progress and back off, releasing their locks and then wait for the reset completion. We must do this for _all_ GPU waiters that might hold locks that i915_error_work_func() and the reset code needs to acquire. works for me. > - * doesn't deadlock trying to grab various locks. > + * doesn't deadlock trying to grab various locks. We must do > + * this for _all_ gpu waiters that might hold locks that the > + * reset work needs to acquire. > */ > for_each_ring(ring, dev_priv, i) > wake_up_all(&ring->irq_queue); > + > + wake_up_all(&dev_priv->pending_flip_queue); i915_error_wake_up(); or i915_error_wake_up_mutexes()? (Though this is slightly misleading, but it is the intent.) Not 100% convinced that making using a common function is right semantically... -Chris -- Chris Wilson, Intel Open Source Technology Centre -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html