Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery

Michel Thierry <michel.thierry@xxxxxxxxx> · Fri, 12 May 2017 13:55:11 -0700




On 5/8/2017 11:31 AM, Michel Thierry wrote:
On 4/29/2017 7:19 AM, Chris Wilson wrote:
On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
From: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx>

...
+    }
+
     intel_prepare_reset(dev_priv);

     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
@@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
drm_i915_private *dev_priv)
         kobject_uevent_env(kobj,
                    KOBJ_CHANGE, reset_done_event);

+finish:
     /*
      * Note: The wake_up also serves as a memory barrier so that
      * waiters see the updated value of the dev_priv->gpu_error.
@@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
*dev_priv,
                  &dev_priv->gpu_error.flags))
         goto out;

-    i915_reset_and_wakeup(dev_priv);
+    i915_reset_and_wakeup(dev_priv, engine_mask);

? You don't need to wakeup the struct_mutex so we don't need this after
per-engine resets. Time to split up i915_reset_and_wakeup(), because we
certainly shouldn't be calling intel_finish_reset() without first calling
intel_prepare_reset(). Which is right here in my tree...


Looking at your tree, it wouldn't call finish_reset there either, only
these two are called after a successful reset:

finish:
    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
    wake_up_all(&dev_priv->gpu_error.reset_queue);

But you're right, we only need to clear the error flag, no need to call
wake_up_all.

Should I move the per-engine reset to i915_handle_error, and then leave
i915_reset_and_wakeup just for full resets?
That would also make the promotion from per-engine to global look a bit
'clearer'.


I just noticed an issue if I don't call wake_up_all. There can be 
someone else waiting for the reset to complete 
(i915_mutex_lock_interruptible -> i915_gem_wait_for_error).

I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the 
struct mutex (which we won't need in reset-engine) and prevent two 
concurrent reset attempts (which we still want). Time to add a new flag 
for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Here's an example without calling wake_up_all (10s timeout):
[  126.816054] [drm:i915_reset_engine [i915]] resetting rcs0
...
[  137.499910] [IGT] gem_ringfill: exiting, ret=0

Compared to the one that does,
[   69.799519] [drm:i915_reset_engine [i915]] resetting rcs0
...
[   69.801335] [IGT] gem_tdr: exiting, ret=0

Thanks,

-Michel





_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx