Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture

Tomas Elf <tomas.elf@xxxxxxxxx> · Wed, 14 Oct 2015 12:46:51 +0100

On 13/10/2015 12:39, Daniel Vetter wrote:
On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote:
On 09/10/2015 08:48, Chris Wilson wrote:
On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
Since we're not synchronizing the ring request list during error state capture
the request list state might change between the time the corresponding error
request list was allocated and dimensioned to the time when the ring request
list is actually captured into the error state. If this happens, throw a
WARNING and do early exit and be aware that the captured error state might not
be fully reliable.

Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx>
---
  drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
  1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 32c1799..cc75ca4 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
  		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
  			struct drm_i915_error_request *erq;

+			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {

Request cannot be null, count can legitmately be more, the WARN on is
inappropriate. Again, I sent several patches over the past couple of
years to fix this.
-Chris


Ok, so having the driver request list change grow in between the point where
we allocate and set up the error state request list to the same size as the
driver request list (since that's what count being larger than the list size
implies) is legitimate? Traversing into unallocated memory seems pretty
dodgy to me but if you say so.

We still need to handle it ofc, but just not WARN on this condition since
it can happen.
-Daniel


With the RCU discussion ongoing I guess we should we just drop this 
patch? I agree that what I've been seeing looks like a side-effect of 
concurrent memory deallocation. Whatever solution you reach should make 
this patch pointless.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx