Re: [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered

Daniel Vetter <daniel@xxxxxxxx> · Tue, 17 Aug 2021 11:57:53 +0200

On Tue, Aug 17, 2021 at 11:47:53AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> > When unblocking a context, do not enable scheduling if the context is
> > banned, guc_id invalid, or not registered.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > Cc: <stable@xxxxxxxxxxxxxxx>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index c3b7bf7319dd..353899634fa8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> > +		     context_guc_id_invalid(ce) ||
> > +		     !lrc_desc_registered(guc, ce->guc_id) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> 
> I think this entire if condition here is screaming that our intel_context
> state machinery for guc is way too complex, and on the wrong side of
> incomprehensible.
> 
> Also some of these check state outside of the context, and we don't seem
> to hold spinlocks for those, or anything else.
> 
> I general I have no idea which of these are defensive programming and
> cannot ever happen, and which actually can happen. There's for sure way
> too many races going on given that this is all context-local stuff.

Races here meaining that we seem to be dropping locks while the context is
in an inconsistent state, which then means that every other code path
touching contexts needs to check whether the context is in an inconsistent
state.

This is a bit an example of protecting code, vs protecting datastructures.
Protecting code is having state bits of intermediate/transitional state
leak outside of the locked section (like context_blocked), so that every
other piece of code must be aware about the transition and not screw
things up for worse when they race.

This means your review and validation effort scales O(N^2) with the amount
of code and features you have. Which doesn't work.

Datastructure or object oriented locking design goes different:

1. You figure out what the invariants of your datastructure are. That
means what should hold after each state transition is finished. I have no
idea what is the solution for all them here, but e.g. why is
context_blocked even visible to other threads? Usual approach is a) take
lock b) do whatever is necessary (we're talking about reset stuff here, so
performance really doesn't matter) c) unlock. I know that i915-gem is full
of these leaky counting things, but that's really not a good design.

2. Next up, for every piece of state you think how it's protected with a
per-object lock. The fewer locks you have (but still per-objects so it's
not becoming a mess for different reasons) the higher chances that you
don't leak inconsistent state to other threads. This is a bit tricky when
multipled objects are involved, or if you have to split your locks for a
single object because some of it needs to be accessed from irq context
(like a tasklet).

3. Document your rules in kerneldoc, so that when new code gets added you
don't have to review everything for consistency against the rules. This
way you get overall O(N) effort for validation and review, because all you
have to do is check every function that changes state against the overall
contract, and not everything against everything else.

If you have a pile of if checks every time you grab a lock, your locking
design has too much state that leaks outside of the locked sections.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch