My g33 here seems to be shockingly good at hitting them all. This time around kms_flip/flip-vs-panning-vs-hang blows up: intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and if a gpu hang is pending aborts the wait for outstanding flips so that the setcrtc call will succeed and release the crtc mutex. And the gpu hang handler needs that lock in intel_display_handle_reset to be able to complete outstanding flips. The problem is that we can race in two ways: - Waiters on the dev_priv->pending_flip_queue aren't woken up after we've the reset as pending, but before we actually start the reset work. This means that the waiter doesn't notice the pending reset and hence will keep on hogging the locks. Like with dev->struct_mutex and the ring->irq_queue wait queues we there need to wake up everyone that potentially holds a lock which the reset handler needs. - intel_display_handle_reset was called _after_ we've already signalled the completion of the reset work. Which means a waiter could sneak in, grab the lock and never release it (since the pageflips won't ever get released). Similar to resetting the gem state all the reset work must complete before we update the reset counter. Contrary to the gem reset we don't need to have a second explicit wake up call since that will have happened already when completing the pageflips. We also don't have any issues that the completion happens while the reset state is still pending - wait_for_pending_flips is only there to ensure we display the right frame. After a gpu hang&reset events such guarantees are out the window anyway. This is in contrast to the gem code where too-early wake-up would result in unnecessary restarting of ioctls. Also, since we've gotten these various deadlocks and ordering constraints wrong so often throw copious amounts of comments at the code. This deadlock regression has been introduced in the commit which added the pageflip reset logic to the gpu hang work: commit 96a02917a0131e52efefde49c2784c0421d6c439 Author: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> Date: Mon Feb 18 19:08:49 2013 +0200 drm/i915: Finish page flips and update primary planes after a GPU reset Cc: stable@xxxxxxxxxxxxxxx Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxx> --- drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 83cce0c..4fc54cb 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1506,8 +1506,16 @@ static void i915_error_work_func(struct work_struct *work) kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, reset_event); + /* + * All state reset _must_ have completed before we update the + * reset counter, for otherwise waiters might miss the reset + * pending state and not properly drop locks, resuling in + * deadlocks with the reset work. + */ ret = i915_reset(dev); + intel_display_handle_reset(dev); + if (ret == 0) { /* * After all the gem state is reset, increment the reset @@ -1531,7 +1539,11 @@ static void i915_error_work_func(struct work_struct *work) for_each_ring(ring, dev_priv, i) wake_up_all(&ring->irq_queue); - intel_display_handle_reset(dev); + /* + * intel_display_handle_reset already wakes up all waiters on + * the dev_priv->pending_flip_queue by (eventually) calling down + * into do_finish_flip. So no need to wake them up again. + */ wake_up_all(&dev_priv->gpu_error.reset_queue); } @@ -1654,10 +1666,14 @@ void i915_handle_error(struct drm_device *dev, bool wedged) /* * Wakeup waiting processes so that the reset work item - * doesn't deadlock trying to grab various locks. + * doesn't deadlock trying to grab various locks. We must do + * this for _all_ gpu waiters that might hold locks that the + * reset work needs to acquire. */ for_each_ring(ring, dev_priv, i) wake_up_all(&ring->irq_queue); + + wake_up_all(&dev_priv->pending_flip_queue); } /* -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html