Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 12 May 2017 22:09:23 +0100



On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote:
> On 5/8/2017 11:31 AM, Michel Thierry wrote:
> >On 4/29/2017 7:19 AM, Chris Wilson wrote:
> >>On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
> >>>From: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx>
> >>>
> ...
> >>>+    }
> >>>+
> >>>     intel_prepare_reset(dev_priv);
> >>>
> >>>     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
> >>>@@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
> >>>drm_i915_private *dev_priv)
> >>>         kobject_uevent_env(kobj,
> >>>                    KOBJ_CHANGE, reset_done_event);
> >>>
> >>>+finish:
> >>>     /*
> >>>      * Note: The wake_up also serves as a memory barrier so that
> >>>      * waiters see the updated value of the dev_priv->gpu_error.
> >>>@@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
> >>>*dev_priv,
> >>>                  &dev_priv->gpu_error.flags))
> >>>         goto out;
> >>>
> >>>-    i915_reset_and_wakeup(dev_priv);
> >>>+    i915_reset_and_wakeup(dev_priv, engine_mask);
> >>
> >>? You don't need to wakeup the struct_mutex so we don't need this after
> >>per-engine resets. Time to split up i915_reset_and_wakeup(), because we
> >>certainly shouldn't be calling intel_finish_reset() without first calling
> >>intel_prepare_reset(). Which is right here in my tree...
> >>
> >
> >Looking at your tree, it wouldn't call finish_reset there either, only
> >these two are called after a successful reset:
> >
> >finish:
> >    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
> >    wake_up_all(&dev_priv->gpu_error.reset_queue);
> >
> >But you're right, we only need to clear the error flag, no need to call
> >wake_up_all.
> >
> >Should I move the per-engine reset to i915_handle_error, and then leave
> >i915_reset_and_wakeup just for full resets?
> >That would also make the promotion from per-engine to global look a bit
> >'clearer'.
> >
> 
> I just noticed an issue if I don't call wake_up_all. There can be
> someone else waiting for the reset to complete
> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
> 
> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
> struct mutex (which we won't need in reset-engine) and prevent two
> concurrent reset attempts (which we still want). Time to add a new
> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Yes, that would be a good idea to avoid dual purposing the bits. Now
that we do direct resets along the wait path, we can completely drop the
i915_mutex_interruptible(). (No one else should be holding the mutex
indefinitely.) I think that's a better approach -- I think we've already
moved all the EIO magic aware to the ABI points where we deemed it
necessary.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx