Re: [PATCH] drm/i915: Stop asserting on set-wedged vs nop_submit_request ordering

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 13 Oct 2017 15:40:14 +0100

Quoting Mika Kuoppala (2017-10-13 15:23:41)
> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:
> 
> > Since the removal of the stop_machine(), it is allowed and expected for
> > the nop_submit_request() and nop_complete_submit_request() to run in
> > parallel to the i915_gem_set_wedged() processing. As such we can no
> > longer assert that i915_gem_set_wedged() has completed inside the
> > stop_machine prior to the individual nop_submit_request execution.
> >
> > Fixes: af7a8ffad9c5 ("drm/i915: Use rcu instead of stop_machine in set_wedged")
> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: Daniel Vetter <daniel.vetter@xxxxxxxxx>
> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx>
> 
> from irc:
> 17:12 < danvet> r-b: me
> 
> also,
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>

Just a note to say we could, move the set_bit(WEDGED) first (followed by
a smb_mb__after_atomic()) so that the concurrent nop_submit_request
would see the right bit. However, moving that bit requires a bit more
thought wrt to the all the users and what it means for
i915_gem_set_wedged(). Simpler just to remove the incorrect BUG_ON for
now and address again in future.

The other challenge is hitting this race in testing. We have to
coordinate the requests becoming ready in parallel to the failed
reset. Something like queueing 100,000 requests and signaling them at
intervals of a couple of micros-econds would do the trick. That itself
is not too difficult, the remaining challenge will be in coordinating
that with the reset -- so using hangcheck is out the window and we must
trigger the EIO directly. Sounds easy!

Thanks for the review,
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx