On Mon, Aug 09, 2021 at 03:35:26PM +0200, Daniel Vetter wrote: > On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote: > > Resets are notoriously hard to get fully working and notoriously racey, > > especially with selftests / IGTs that do all sorts of wild things that > > would be near impossible to hit during normal use cases. Even though > > likely impossible to hit, anything selftests / IGTs uncover needs to be > > fixed. This patch addresses 7 such issues. > > > > 1. A small race that could result in incorrect accounting of the number > > of outstanding G2H. Basically prior to this patch we did not increment > > the number of outstanding G2H if we encoutered a GT reset while sending > > a H2G. This was incorrect as the context state had already been updated > > to anticipate a G2H response thus the counter should be incremented. > > > > 2. When unwinding requests on a reset context, if other requests in the > > context are in the priority list the requests could be resubmitted out > > of seqno order. Traverse the list of active requests in reserve and > > append to the head of the priority list to fix this. > > > > 3. Don't drop ce->guc_active.lock when unwinding a context after reset. > > At one point we had to drop this because of a lock inversion but that is > > no longer the case. It is much safer to hold the lock so let's do that. > > > > 4. Prior to this patch the blocked context counter was cleared on > > init_sched_state (used during registering a context & resets) which is > > incorrect. This state needs to be persistent or the counter can read the > > incorrect value. > > > > 5. Flush the work queue for GuC generated G2H messages during a GT reset. > > > > 6. Do not clear enable during a context reset if a schedule enable is in > > flight. > > > > 7. When unblocking a context, do not enable scheduling if the context is > > banned. > > I think each of the above should be a separate patch. I think it would > also be good if each fix references the commits that introduced/changed > something. > Sure, just was trying to cheat and make our lives easier with less patches to backport into DII. > Most of this stuff is extremely hard to get right, and unfortunately our > current code is way too fond of lockless trickery (which really isn't a > great idea in the reset code). We need to apply as much care as possible > here. > Yep, resets are hard. It is hard because like ten other async things (e.g. a new submission, registering a context, banning a context, canceling a request, processing a G2H, trying to idle the GPU, unpinning a context) can all be happening at the same time. Hopefully when we move the DRM scheduler we can remove some of these async operations, perma-pinned contexts would also help too. Have a story for that + a story to simplify the locking. > Also expect me to ask a lot of annoying questions about all the atomic_t > you touch :-) Looking forward to it. Matt > -Daniel > > > > > > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer") > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface") > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > Cc: <stable@xxxxxxxxxxxxxxx> > > --- > > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++------- > > 1 file changed, 27 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > index 87d8dc8f51b9..cd8df078ca87 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce) > > { > > /* Only should be called from guc_lrc_desc_pin() */ > > atomic_set(&ce->guc_sched_state_no_lock, 0); > > - ce->guc_state.sched_state = 0; > > + ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK; > > } > > > > static inline bool > > @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, > > { > > int err; > > > > - err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); > > - > > - if (!err && g2h_len_dw) > > + if (g2h_len_dw) > > atomic_inc(&guc->outstanding_submission_g2h); > > > > + err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); > > + if (err == -EBUSY && g2h_len_dw) > > + atomic_dec(&guc->outstanding_submission_g2h); > > + > > return err; > > } > > > > @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc) > > wait_for_reset(guc, &guc->outstanding_submission_g2h); > > } while (!list_empty(&guc->ct.requests.incoming)); > > } > > + > > + /* Flush any GuC generated G2H */ > > + while (!list_empty(&guc->ct.requests.incoming)) > > + msleep(1); > > + > > scrub_guc_desc_for_outstanding_g2h(guc); > > } > > > > @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce) > > > > spin_lock_irqsave(&sched_engine->lock, flags); > > spin_lock(&ce->guc_active.lock); > > - list_for_each_entry_safe(rq, rn, > > - &ce->guc_active.requests, > > - sched.link) { > > + list_for_each_entry_safe_reverse(rq, rn, > > + &ce->guc_active.requests, > > + sched.link) { > > if (i915_request_completed(rq)) > > continue; > > > > list_del_init(&rq->sched.link); > > - spin_unlock(&ce->guc_active.lock); > > > > __i915_request_unsubmit(rq); > > > > @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce) > > } > > GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine)); > > > > - list_add_tail(&rq->sched.link, pl); > > + list_add(&rq->sched.link, pl); > > set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); > > - > > - spin_lock(&ce->guc_active.lock); > > } > > spin_unlock(&ce->guc_active.lock); > > spin_unlock_irqrestore(&sched_engine->lock, flags); > > @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce) > > static void __guc_reset_context(struct intel_context *ce, bool stalled) > > { > > struct i915_request *rq; > > + unsigned long flags; > > u32 head; > > > > intel_context_get(ce); > > > > /* > > - * GuC will implicitly mark the context as non-schedulable > > - * when it sends the reset notification. Make sure our state > > - * reflects this change. The context will be marked enabled > > - * on resubmission. > > + * GuC will implicitly mark the context as non-schedulable when it sends > > + * the reset notification. Make sure our state reflects this change. The > > + * context will be marked enabled on resubmission. A small window exists > > + * where the context could be block & unblocked (scheduling enable) while > > + * this reset was inflight. If a scheduling enable is already is in > > + * flight do not clear the enable. > > */ > > - clr_context_enabled(ce); > > + spin_lock_irqsave(&ce->guc_state.lock, flags); > > + if (!context_pending_enable(ce)) > > + clr_context_enabled(ce); > > + spin_unlock_irqrestore(&ce->guc_state.lock, flags); > > > > rq = intel_context_find_active_request(ce); > > if (!rq) { > > @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce) > > spin_lock_irqsave(&ce->guc_state.lock, flags); > > > > if (unlikely(submission_disabled(guc) || > > + intel_context_is_banned(ce) || > > !intel_context_is_pinned(ce) || > > context_pending_disable(ce) || > > context_blocked(ce) > 1)) { > > -- > > 2.28.0 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch