On Tue, Aug 17, 2021 at 11:47:53AM +0200, Daniel Vetter wrote: > On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote: > > When unblocking a context, do not enable scheduling if the context is > > banned, guc_id invalid, or not registered. > > > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > Cc: <stable@xxxxxxxxxxxxxxx> > > --- > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > index c3b7bf7319dd..353899634fa8 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce) > > spin_lock_irqsave(&ce->guc_state.lock, flags); > > > > if (unlikely(submission_disabled(guc) || > > + intel_context_is_banned(ce) || > > + context_guc_id_invalid(ce) || > > + !lrc_desc_registered(guc, ce->guc_id) || > > !intel_context_is_pinned(ce) || > > context_pending_disable(ce) || > > context_blocked(ce) > 1)) { > > I think this entire if condition here is screaming that our intel_context > state machinery for guc is way too complex, and on the wrong side of > incomprehensible. > > Also some of these check state outside of the context, and we don't seem > to hold spinlocks for those, or anything else. > > I general I have no idea which of these are defensive programming and > cannot ever happen, and which actually can happen. There's for sure way > too many races going on given that this is all context-local stuff. Races here meaining that we seem to be dropping locks while the context is in an inconsistent state, which then means that every other code path touching contexts needs to check whether the context is in an inconsistent state. This is a bit an example of protecting code, vs protecting datastructures. Protecting code is having state bits of intermediate/transitional state leak outside of the locked section (like context_blocked), so that every other piece of code must be aware about the transition and not screw things up for worse when they race. This means your review and validation effort scales O(N^2) with the amount of code and features you have. Which doesn't work. Datastructure or object oriented locking design goes different: 1. You figure out what the invariants of your datastructure are. That means what should hold after each state transition is finished. I have no idea what is the solution for all them here, but e.g. why is context_blocked even visible to other threads? Usual approach is a) take lock b) do whatever is necessary (we're talking about reset stuff here, so performance really doesn't matter) c) unlock. I know that i915-gem is full of these leaky counting things, but that's really not a good design. 2. Next up, for every piece of state you think how it's protected with a per-object lock. The fewer locks you have (but still per-objects so it's not becoming a mess for different reasons) the higher chances that you don't leak inconsistent state to other threads. This is a bit tricky when multipled objects are involved, or if you have to split your locks for a single object because some of it needs to be accessed from irq context (like a tasklet). 3. Document your rules in kerneldoc, so that when new code gets added you don't have to review everything for consistency against the rules. This way you get overall O(N) effort for validation and review, because all you have to do is check every function that changes state against the overall contract, and not everything against everything else. If you have a pile of if checks every time you grab a lock, your locking design has too much state that leaks outside of the locked sections. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch