Re: regression since v4.9-rcX: "Resetting chip after gpu hang" times out(?) and is repeated every 20th second

Mark Janes <mark.a.janes@xxxxxxxxx> · Tue, 17 Jan 2017 08:37:07 -0800

Bjørn Mork <bjorn@xxxxxxx> writes:

> Hello,
>
> I've been having occasional GPU HANGs on my Skylake laptop ever since I
> got it, originally reported here:
> https://bugs.freedesktop.org/show_bug.cgi?id=96894

Several similar bugs have been resolved recently.  I apologize for
missing this one.

I'll update this bug with a request for more information.

> But this is not the reason I try this list.  The HANGs used to be
> resolved nicely by the driver up to and including v4.8.  The GPU was
> reset and that was that.  A noticable hang for a few seconds, and the
> usual log messages, but that was it.  I could easily live with it.
>
> v4.9-rcX changed that, making the HANGs a real show stopper problem: The
> GPU reset started failing. From the log messges, it looks like the reset
> times out and is repeated every 20th second "forever". Something will
> give up and kill the X server in the end, resolving the hang with an X
> server restart.
>
> [19308.656674] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1171], reason: Hang on render ring, action: reset
> [19308.656769] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
> [19308.656770] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
> [19308.656771] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
> [19308.656772] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
> [19308.656773] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [19308.657131] drm/i915: Resetting chip after gpu hang
> [19308.657752] [drm] RC6 on
> [19308.677139] [drm] GuC firmware load skipped
> [19328.645312] drm/i915: Resetting chip after gpu hang
> [19328.649380] [drm] RC6 on
> [19328.668497] [drm] GuC firmware load skipped
> [19348.612672] drm/i915: Resetting chip after gpu hang
> [19348.613017] [drm] RC6 on
> [19348.630830] [drm] GuC firmware load skipped
> [19364.612475] drm/i915: Resetting chip after gpu hang
> [19364.614544] [drm] RC6 on
> [19364.629781] [drm] GuC firmware load skipped
> [19382.660101] drm/i915: Resetting chip after gpu hang
> [19382.660955] [drm] RC6 on
> [19382.680661] [drm] GuC firmware load skipped
> [19402.628876] drm/i915: Resetting chip after gpu hang
> [19402.629229] [drm] RC6 on
> [19402.643134] [drm] GuC firmware load skipped
> [19422.660054] drm/i915: Resetting chip after gpu hang
> [19422.660419] [drm] RC6 on
> [19422.675415] [drm] GuC firmware load skipped
> [19440.644097] drm/i915: Resetting chip after gpu hang
> [19440.644558] [drm] RC6 on
> [19440.663878] [drm] GuC firmware load skipped
> [19458.627752] drm/i915: Resetting chip after gpu hang
> [19458.634024] [drm] RC6 on
> [19458.650700] [drm] GuC firmware load skipped
> [19478.659877] drm/i915: Resetting chip after gpu hang
> [19478.665303] [drm] RC6 on
> [19478.684634] [drm] GuC firmware load skipped
> [19498.627632] drm/i915: Resetting chip after gpu hang
> [19498.634862] [drm] RC6 on
> [19498.653638] [drm] GuC firmware load skipped
> [19510.659670] drm/i915: Resetting chip after gpu hang
> [19510.665894] [drm] RC6 on
> [19510.680479] [drm] GuC firmware load skipped
>
>
> Having a multi minute hang followed by losing every running X client is
> obviously a lot worse than a simple GPU reset. This makes the i915
> driver after v4.8 unusable to me...
>
> The earliest v4.9-rc I tested was v4.9-rc5, so that's the earliest
> version I know has this issue.  The issue is still present in v4.10-rc4.
>
> I would love to be able to be more precise about when this bug was
> introduced, but the triggering HANG issues are just rare enough to make
> anything like git bisect impossible.  The current frequency is only once
> or twice a week.  More than enough to make me lose my hair, but far from
> often enough for any systematic testing of versions or patches.
>
> Trying to force a HANG by writing to /sys/kernel/debug/dri/0/i915_wedged
> did not have the same effect. This only casued a single reset message
> and everything was immediately OK.  Possibly because I don't know what
> mask to write to write to i915_wedged.  Is there any way to figure that
> out based on the /sys/class/drm/card0/error from the real hang?  Or any
> other way to guess it?
>
>
> Please let me know if there is anything I can do to debug this problem
> further, or if there are known workarounds.
>
>
>
> Bjørn
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx