regression since v4.9-rcX: "Resetting chip after gpu hang" times out(?) and is repeated every 20th second

Bjørn Mork <bjorn@xxxxxxx> · Mon, 16 Jan 2017 15:49:10 +0100

Hello,

I've been having occasional GPU HANGs on my Skylake laptop ever since I
got it, originally reported here:
https://bugs.freedesktop.org/show_bug.cgi?id=96894

But this is not the reason I try this list.  The HANGs used to be
resolved nicely by the driver up to and including v4.8.  The GPU was
reset and that was that.  A noticable hang for a few seconds, and the
usual log messages, but that was it.  I could easily live with it.

v4.9-rcX changed that, making the HANGs a real show stopper problem: The
GPU reset started failing. From the log messges, it looks like the reset
times out and is repeated every 20th second "forever". Something will
give up and kill the X server in the end, resolving the hang with an X
server restart.

[19308.656674] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in Xorg [1171], reason: Hang on render ring, action: reset
[19308.656769] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[19308.656770] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[19308.656771] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[19308.656772] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[19308.656773] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[19308.657131] drm/i915: Resetting chip after gpu hang
[19308.657752] [drm] RC6 on
[19308.677139] [drm] GuC firmware load skipped
[19328.645312] drm/i915: Resetting chip after gpu hang
[19328.649380] [drm] RC6 on
[19328.668497] [drm] GuC firmware load skipped
[19348.612672] drm/i915: Resetting chip after gpu hang
[19348.613017] [drm] RC6 on
[19348.630830] [drm] GuC firmware load skipped
[19364.612475] drm/i915: Resetting chip after gpu hang
[19364.614544] [drm] RC6 on
[19364.629781] [drm] GuC firmware load skipped
[19382.660101] drm/i915: Resetting chip after gpu hang
[19382.660955] [drm] RC6 on
[19382.680661] [drm] GuC firmware load skipped
[19402.628876] drm/i915: Resetting chip after gpu hang
[19402.629229] [drm] RC6 on
[19402.643134] [drm] GuC firmware load skipped
[19422.660054] drm/i915: Resetting chip after gpu hang
[19422.660419] [drm] RC6 on
[19422.675415] [drm] GuC firmware load skipped
[19440.644097] drm/i915: Resetting chip after gpu hang
[19440.644558] [drm] RC6 on
[19440.663878] [drm] GuC firmware load skipped
[19458.627752] drm/i915: Resetting chip after gpu hang
[19458.634024] [drm] RC6 on
[19458.650700] [drm] GuC firmware load skipped
[19478.659877] drm/i915: Resetting chip after gpu hang
[19478.665303] [drm] RC6 on
[19478.684634] [drm] GuC firmware load skipped
[19498.627632] drm/i915: Resetting chip after gpu hang
[19498.634862] [drm] RC6 on
[19498.653638] [drm] GuC firmware load skipped
[19510.659670] drm/i915: Resetting chip after gpu hang
[19510.665894] [drm] RC6 on
[19510.680479] [drm] GuC firmware load skipped

Having a multi minute hang followed by losing every running X client is
obviously a lot worse than a simple GPU reset. This makes the i915
driver after v4.8 unusable to me...

The earliest v4.9-rc I tested was v4.9-rc5, so that's the earliest
version I know has this issue.  The issue is still present in v4.10-rc4.

I would love to be able to be more precise about when this bug was
introduced, but the triggering HANG issues are just rare enough to make
anything like git bisect impossible.  The current frequency is only once
or twice a week.  More than enough to make me lose my hair, but far from
often enough for any systematic testing of versions or patches.

Trying to force a HANG by writing to /sys/kernel/debug/dri/0/i915_wedged
did not have the same effect. This only casued a single reset message
and everything was immediately OK.  Possibly because I don't know what
mask to write to write to i915_wedged.  Is there any way to figure that
out based on the /sys/class/drm/card0/error from the real hang?  Or any
other way to guess it?

Please let me know if there is anything I can do to debug this problem
further, or if there are known workarounds.

Bjørn
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx