Re: [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it

Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx> · Thu, 28 Sep 2023 18:47:00 +0200

Hi Matt,

> > While discussing with Nirmoy offline about this other way for
> > fixing lock contention, he was a bit sceptical about it.
> > 
> > But why not? We know that if we fall into this case it's because
> > some hardware component has forgotten to release the lock within
> > 100ms. So that we have two possibilities, either bail out or
> > force the unlock.
> > 
> > Forcing the unlock might not be respectful to the environment,
> > but, at the end, i915 should have the highest priority.
> > 
> > Nirmoy's solution here[*] is to force the unlock during gt
> > resume, but what happens if meantime the hardware takes the lock
> > and doesn't release it?
> > 
> > Open for opinions or profligate rejections :-)
> > 
> > I'm also curious to see what CI has to say about.
> > 
> > [*] https://patchwork.freedesktop.org/series/124397/
> > 
> 
> As far as I can tell, this patch doesn't really do anything beneficial
> that I can see.  We already unlock and proceed today if we hit a lock
> timeout:
> 
>  - intel_gt_mcr_lock
>    - attempt to get lock
>    - timeout, warn, add CI taint
>  - perform MCR register access even if the lock failed
>  - intel_gt_mcr_unlock
>    - lock is released regardless of whether we obtained it successfully
>      at the beginning, or whether someone else was still holding it
> 
> With your patch, it looks like you're just adding an extra
> unlock/reacquire step before we move on which I don't think accomplishes
> anything.  If someone else forgot to release the lock, then we're still
> protected from other agents, and we'll take care of releasing it
> ourselves once we're done.  If the other agent actually is still using
> the lock and they're just going slower than we expected, then when they
> finally finish they're just going to blindly unlock; if we're in the
> middle of our critical section at that point, they'll release our lock
> the same way we released theirs.  The main change here is that when we
> hit a timeout, your patch is giving other outside agents a chance to sneak in
> and re-grab the lock, further delaying our KMD acquisition.
> 
> The real-world IFWI problems we saw, which Nirmoy's series is working
> around, is that some boot-time agent simply forgot to ever release the
> lock, leaving it locked "forever" so it makes sense to sanitize it
> initially.  Load/resume is the only time when it's actually "safe" to
> reset/sanitize the lock like that.  If we're getting MCR timeouts during
> regular driver operation (i.e., not during the beginning of driver load
> or resume), then it either means our timeout values are too quick (i.e.,
> we're not giving external agents sufficient time to run their critical
> sections), or some piece of system firmware (e.g., pcode) has completely
> died in the middle of its critical section.  In the former case, we
> probably need to adjust our timeout amount (and possibly work with those
> firmware teams to see if they can reduce the size of their critical
> sections).  In the latter case, the system is going to be so badly
> broken that it doesn't really matter what we do; we're just not going to
> have a functioning system anymore at that point and it's not something
> the graphics driver has a way of recovering from.

Makes sense... thanks!

Andi