Re: S4 resume breakage with i915 driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 25 Aug 2016 17:32:41 +0200,
Chris Wilson wrote:
> 
> On Thu, Aug 25, 2016 at 03:11:35PM +0200, Takashi Iwai wrote:
> > Hi,
> > 
> > while debugging our 4.4.x based SLE12-SP2 kernel, we noticed that S4
> > resume is broken on many machines with i915 gfx, even on the upstream
> > 4.7 kernel.  Originally it was reported by Intel about SKL machines,
> > but later on, we found that it hits on many other older chips (at
> > least HSW), too.
> > 
> > This was hard to identify because there have been other S4 resume bugs
> > until recently.  But even after these fixes, when the system is tested
> > on i915 gfx, the system gets memory corruption or kernel Oops sooner
> > or later after a few (usually < 10) S4 cycles.
> > 
> > As the bug happened between 4.2 and 4.3, I bisected and it pointed to
> > the commit:
> > 
> >   4c436d55b279bbc6b02aac02e7dc683fc09f884e
> >     drm/i915: Enable Resource Streamer state save/restore on MI_SET_CONTEXT
> > 
> > Indeed, reverting this on top of our 4.4.x kernel seems to make S4
> > working stably (at least on a test machine).
> > 
> > Does this make any sense to you guys?
> 
> It could explain a HSW issue, but not SKL and not BDW by default (using
> execlists).

Hrm, IIRC, SKL didn't work without the commit.  But it doesn't matter
if this doesn't has a bad influence on SKL, either.

> All machine big core, no Braswell or other !llc device?

Yeah, tested only on SKL, HSW and IVY.  No small boxes.

> > Since the commit message doesn't give a good explanation about this
> > change, I wonder what's the purpose of this commit.  Was it merely
> > optimization?
> 
> It's a step towards enabling the Resource Streamer hardware block in the
> GPU, kind of a glorified prefetch. Userspace also has to notify the GPU
> to enable it for a batch as well.
> 
> The only doubt in mind my about that patch was whether the hw ignored
> the RS_RESTORE_STATE if MI_RESTORE_INHIBIT as expected:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 3c97f0e7a003..c618bb86aeb9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -574,10 +574,15 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>  
>         /* These flags are for resource streamer on HSW+ */
>         if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
> -               flags |= (HSW_MI_RS_SAVE_STATE_EN | HSW_MI_RS_RESTORE_STATE_EN);
> +               flags |= HSW_MI_RS_SAVE_STATE_EN;
>         else if (INTEL_GEN(dev_priv) < 8)
> -               flags |= (MI_SAVE_EXT_STATE_EN | MI_RESTORE_EXT_STATE_EN);
> -
> +               flags |= MI_SAVE_EXT_STATE_EN;
> +       if ((flags & MI_RESTORE_INHIBIT) == 0) {
> +               if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
> +                       flags |= HSW_MI_RS_RESTORE_STATE_EN;
> +               else if (INTEL_GEN(dev_priv) < 8)
> +                       flags |= MI_RESTORE_EXT_STATE_EN;
> +       }
>  
> 
> But this code is only executed for Haswell, and we only inhibit restore
> from uninitialised contexts. The biggest fear is that we are restoring
> garbage from userspace and making the GPU do strange things.
> 
> That still doesn't explain stray writes into system memory, that is more
> likely our GTT being broken.

Yeah, that's my expectation.  But the bisection didn't reach it.
So maybe this casually surfaced the hidden GTT or cache issue.

> Could you confirm that bisect has any
> impact on the other machines, and of course double check the result?

You're asking bisection on all machines from the scratch for such a
bug taking so long time to reproduce, and especially for i915 code
path, that is known to be deadly difficult due to various merge
commits?  I sincerely decline the offer :)

Yes, the result was double-checked.  This has a positive effect on all
our tested machines.  Maybe But it's hard to tell exactly whether this is
the 100% culprit.  As said, there have been multiple S4 bugs, so far.
IVY worked without this patch (after x86 fixes), but obviously this
had no negative effect, either.


thanks,

Takashi
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux