Re: S4 resume breakage with i915 driver

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 25 Aug 2016 16:32:41 +0100

On Thu, Aug 25, 2016 at 03:11:35PM +0200, Takashi Iwai wrote:
> Hi,
> 
> while debugging our 4.4.x based SLE12-SP2 kernel, we noticed that S4
> resume is broken on many machines with i915 gfx, even on the upstream
> 4.7 kernel.  Originally it was reported by Intel about SKL machines,
> but later on, we found that it hits on many other older chips (at
> least HSW), too.
> 
> This was hard to identify because there have been other S4 resume bugs
> until recently.  But even after these fixes, when the system is tested
> on i915 gfx, the system gets memory corruption or kernel Oops sooner
> or later after a few (usually < 10) S4 cycles.
> 
> As the bug happened between 4.2 and 4.3, I bisected and it pointed to
> the commit:
> 
>   4c436d55b279bbc6b02aac02e7dc683fc09f884e
>     drm/i915: Enable Resource Streamer state save/restore on MI_SET_CONTEXT
> 
> Indeed, reverting this on top of our 4.4.x kernel seems to make S4
> working stably (at least on a test machine).
> 
> Does this make any sense to you guys?

It could explain a HSW issue, but not SKL and not BDW by default (using
execlists).

All machine big core, no Braswell or other !llc device?
 
> Since the commit message doesn't give a good explanation about this
> change, I wonder what's the purpose of this commit.  Was it merely
> optimization?

It's a step towards enabling the Resource Streamer hardware block in the
GPU, kind of a glorified prefetch. Userspace also has to notify the GPU
to enable it for a batch as well.

The only doubt in mind my about that patch was whether the hw ignored
the RS_RESTORE_STATE if MI_RESTORE_INHIBIT as expected:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 3c97f0e7a003..c618bb86aeb9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -574,10 +574,15 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
 
        /* These flags are for resource streamer on HSW+ */
        if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
-               flags |= (HSW_MI_RS_SAVE_STATE_EN | HSW_MI_RS_RESTORE_STATE_EN);
+               flags |= HSW_MI_RS_SAVE_STATE_EN;
        else if (INTEL_GEN(dev_priv) < 8)
-               flags |= (MI_SAVE_EXT_STATE_EN | MI_RESTORE_EXT_STATE_EN);
-
+               flags |= MI_SAVE_EXT_STATE_EN;
+       if ((flags & MI_RESTORE_INHIBIT) == 0) {
+               if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
+                       flags |= HSW_MI_RS_RESTORE_STATE_EN;
+               else if (INTEL_GEN(dev_priv) < 8)
+                       flags |= MI_RESTORE_EXT_STATE_EN;
+       }
 

But this code is only executed for Haswell, and we only inhibit restore
from uninitialised contexts. The biggest fear is that we are restoring
garbage from userspace and making the GPU do strange things.

That still doesn't explain stray writes into system memory, that is more
likely our GTT being broken. Could you confirm that bisect has any
impact on the other machines, and of course double check the result?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx