Quoting Mika Kuoppala (2017-10-13 15:23:41) > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > Since the removal of the stop_machine(), it is allowed and expected for > > the nop_submit_request() and nop_complete_submit_request() to run in > > parallel to the i915_gem_set_wedged() processing. As such we can no > > longer assert that i915_gem_set_wedged() has completed inside the > > stop_machine prior to the individual nop_submit_request execution. > > > > Fixes: af7a8ffad9c5 ("drm/i915: Use rcu instead of stop_machine in set_wedged") > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: Daniel Vetter <daniel.vetter@xxxxxxxxx> > > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> > > from irc: > 17:12 < danvet> r-b: me > > also, > > Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> Just a note to say we could, move the set_bit(WEDGED) first (followed by a smb_mb__after_atomic()) so that the concurrent nop_submit_request would see the right bit. However, moving that bit requires a bit more thought wrt to the all the users and what it means for i915_gem_set_wedged(). Simpler just to remove the incorrect BUG_ON for now and address again in future. The other challenge is hitting this race in testing. We have to coordinate the requests becoming ready in parallel to the failed reset. Something like queueing 100,000 requests and signaling them at intervals of a couple of micros-econds would do the trick. That itself is not too difficult, the remaining challenge will be in coordinating that with the reset -- so using hangcheck is out the window and we must trigger the EIO directly. Sounds easy! Thanks for the review, -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx