Re: [PATCH] drm/i916: fix wait_for_pending_flips vs gpu hang deadlock

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Sun, 8 Sep 2013 14:20:08 +0100

On Sun, Sep 08, 2013 at 02:52:10PM +0200, Daniel Vetter wrote:
> My g33 here seems to be shockingly good at hitting them all. This time
> around kms_flip/flip-vs-panning-vs-hang blows up:
> 
> intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and
> if a gpu hang is pending aborts the wait for outstanding flips so that
> the setcrtc call will succeed and release the crtc mutex. And the gpu
> hang handler needs that lock in intel_display_handle_reset to be able
> to complete outstanding flips.
> 
> The problem is that we can race in two ways:
> - Waiters on the dev_priv->pending_flip_queue aren't woken up after
>   we've the reset as pending, but before we actually start the reset
>   work. This means that the waiter doesn't notice the pending reset
>   and hence will keep on hogging the locks.
> 
>   Like with dev->struct_mutex and the ring->irq_queue wait queues we
>   there need to wake up everyone that potentially holds a lock which
>   the reset handler needs.
> 
> - intel_display_handle_reset was called _after_ we've already
>   signalled the completion of the reset work. Which means a waiter
>   could sneak in, grab the lock and never release it (since the
>   pageflips won't ever get released).
> 
>   Similar to resetting the gem state all the reset work must complete
>   before we update the reset counter. Contrary to the gem reset we
>   don't need to have a second explicit wake up call since that will
>   have happened already when completing the pageflips. We also don't
>   have any issues that the completion happens while the reset state is
>   still pending - wait_for_pending_flips is only there to ensure we
>   display the right frame. After a gpu hang&reset events such
>   guarantees are out the window anyway. This is in contrast to the gem
>   code where too-early wake-up would result in unnecessary restarting
>   of ioctls.
> 
> Also, since we've gotten these various deadlocks and ordering
> constraints wrong so often throw copious amounts of comments at the
> code.
> 
> This deadlock regression has been introduced in the commit which added
> the pageflip reset logic to the gpu hang work:
> 
> commit 96a02917a0131e52efefde49c2784c0421d6c439
> Author: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx>
> Date:   Mon Feb 18 19:08:49 2013 +0200
> 
>     drm/i915: Finish page flips and update primary planes after a GPU reset
> 
> v2:
> - Add comments to explain how the wake_up serves as memory barriers
>   for the atomic_t reset counter.
> - Improve the comments a bit as suggested by Chris Wilson.
> - Extract the wake_up calls before/after the reset into a little
>   i915_error_wake_up and unconditionally wake up the
>   pending_flip_queue waiters, again as suggested by Chris Wilson.
> 
> Cc: stable@xxxxxxxxxxxxxxx
> Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx>
> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxx>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 55 ++++++++++++++++++++++++++++++-----------
>  1 file changed, 41 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 83cce0c..2331739 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1469,6 +1469,21 @@ static irqreturn_t ironlake_irq_handler(int irq, void *arg)
>  	return ret;
>  }
>  
> +static void i915_error_wake_up(struct drm_i915_private *dev_priv,
> +			       bool reset_completed)
> +{
> +	struct intel_ring_buffer *ring;
> +	int i;

Might as well make that memory barrier explicit:
smb_mb__after_atomic_inc();

The cost of an additional mb is insignificant here.

Something like:

/* Notify all waiters for GPU completion events that reset state has
 * been changed, and that they need to restart their wait after
 * checking for potential errors.
 */

I don't mind repeating ourselves as this logic impacts faraway logic in
i915_gem and intel_display etc.

/* Wakeup __wait_seqno() [dev->struct_mutex] */
> +	for_each_ring(ring, dev_priv, i)
> +		wake_up_all(&ring->irq_queue);
> +

/* Wakeup intel_crtc_wait_for_pending_flips() [dev->mode_config.lock] */
> +	wake_up_all(&dev_priv->pending_flip_queue);
> +

/* Wakeup our error checking mutex after we have finished reseting the GPU,
   i915_gem_wait_for_error() [no locks held by the waiters] */

> +	if (reset_completed)
> +		wake_up_all(&dev_priv->gpu_error.reset_queue);
> +}

With some additional comments to drive home i915_error_wake_up(),
Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html