On 09/10/2015 11:38, Chris Wilson wrote:
On Fri, Oct 09, 2015 at 11:27:48AM +0100, Tomas Elf wrote:
On 09/10/2015 08:41, Chris Wilson wrote:
On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote:
Error state capture is dependent on i915_gem_active_request() and
i915_gem_obj_is_pinned(). Since there is no synchronization between error state
capture and the driver state we at least need to use safe list iterators in
these functions to alleviate the problem of request/vma lists changing during
error state capture.
Does not make it safe.
-Chris
I know it doesn't make it safe, I didn't say this patch makes it safe.
Maybe I was unclear but I chose the word "alleviate" rather than
"solve" to indicate that there are still problems but this change
reduces the problem scope and makes crashes less likely to occur. I
also used the formulation "at least" to indicate that we're not
solving everything but we can do certain things to improve things
somewhat.
The problems I've been seeing has been that the list state changes
during iteration and that the error capture tries to read elements
that are no longer part of the list - not that elements that the
error capture code is dereferencing are deallocated by the driver or
whatever. Using a safe iterator helps with that very particular
problem. Or maybe I guess I've just been incredibly lucky for the
last 2 months when I've been running this code as I've been able to
get 12+ hours of stability during my tests instead of less than one
hour in between crashes that was the case before I introduced these
changes.
You have been incredibily lucky (probably due to how the requests are
being cached now), but that the requests can be modified whilst error
capture runs and oops is well known. Just pretending the list iterator
is safe does nothing.
-Chris
Does nothing except consistently extend meantime between failures from
less than one hour to more than 12 hours during my TDR tests. That's a
lot of luck right there. On a consistent and predictable basis. Also,
I'm not trying to pretend, I thought I communicated clearly that this is
not a solution but rather an improvement that makes certain types of
tests actually possible to run, tests that are needed for the TDR
validation.
But if you have another solution then let's go with that.
Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx