In preparation for the upcoming TDR per-engine hang recovery enablement the stability of the error state capture code needs to be addressed. The biggest reason for this is that in order to test TDR a long-duration test needs to be run for several hours during which a large number of hangs is handled together with the associated error state captures. In its current state the i915 driver experiences various forms of kernel panics and other kinds of fatal errors within the first hour(s) of the hang testing. The patches in this series have been tested with a long-duration hang testing clocking in at 12+ hours and should suffice as an initial improvement. The underlying issue of trying to capture the driver state without synchronization is still a problem that remains to be fixed. One way of at least further alleviating this problem that has been suggested by John Harrison is to do a mutex_trylock() of the struct_mutex for a while (give it a second or so) before going into the error state capture from i915_handle_error(). Then, if nobody is holding the struct_mutex, the error state capture is considerably more safe from sudden state changes. If some thread has hung while holding the struct_mutex one could at least hope that there would be no sudden state changes during error state capture due to the hung state (unless some thread has been caught in a livelock or is perhaps not stuck at all but is simply running for a very long time - still some improvements might be expected here). One fix that has been omitted from this patch series is in regards to the broken ring space calculation following a full GPU reset. Two independent patches to solve this are: "[PATCH] drm/i915: Update ring space correctly on lrc context reset" by Mika Kuoppala and "[51/70] drm/i915: Record the position of the start of the request" by Chris Wilson. Since the solution is currently in review I'll simply mention it here as a pre-requistite for long-duration operations stability testing. Without a fix for this problem the ring space is terminally depleted within the first iterations of the hang test, simply because the ring space is miscalculated following every GPU hang recovery and traversal of the GEM init hw path gradually leading to a terminally hung state. Tomas Elf (8): drm/i915: Early exit from semaphore_waits_for for execlist mode. drm/i915: Migrate to safe iterators in error state capture drm/i915: Cope with request list state change during error state capture drm/i915: NULL checking when capturing buffer objects during error state capture drm/i915: vma NULL pointer check drm/i915: Use safe list iterators drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. drm/i915: NULL check of unpin_work drivers/gpu/drm/i915/i915_gem.c | 18 ++++++++--- drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++++++------------ drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++ drivers/gpu/drm/i915/intel_display.c | 5 +++ 4 files changed, 80 insertions(+), 24 deletions(-) -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx