Hi all, So I've noticed again that the hangman test was failing on some machines here, and tracked it down to the new lockless wait code. Closer inspection showed that we've relied on the single dev->struct_mutex ordering things correctly between waiters and the reset code. But with that lock grabbing gone, the entire reset could happen before the waiter wakes up and hence the waiter never sees a non-zeor wedged value. Which means it'll go right back to sleep, waiting for a seqno which just go cleared out by the reset code. Looking at the code I've declared the entire thing to ad-hoc and revamped it, adding comments explaining what's going on all over the place and auditing for tiny races everywhere. Hopefully I've caugth them all, at least the machines that previously hung after reset are now happily going through a few hundres reset cycles! Comments, flames and especially review highly welcome. For fun (hey, let me have it!) I've thrown in some "let's move stuff around a bit" patches at the beginning ;-) Cheers, Daniel Daniel Vetter (5): drm/i915: move dev_priv->mm out of line drm/i915: extract hangcheck/reset/error_state state into substruct drm/i915: move wedged to the other gpu error handling stuff drm/i915: clear up wedged transitions drm/i915: create a race-free reset detection drivers/gpu/drm/i915/i915_debugfs.c | 12 +- drivers/gpu/drm/i915/i915_dma.c | 9 +- drivers/gpu/drm/i915/i915_drv.c | 8 +- drivers/gpu/drm/i915/i915_drv.h | 274 ++++++++++++++++++-------------- drivers/gpu/drm/i915/i915_gem.c | 110 +++++++------ drivers/gpu/drm/i915/i915_irq.c | 89 +++++++---- drivers/gpu/drm/i915/intel_display.c | 4 +- drivers/gpu/drm/i915/intel_ringbuffer.c | 8 +- 8 files changed, 297 insertions(+), 217 deletions(-) -- 1.7.11.4