Hi Matt, > > While discussing with Nirmoy offline about this other way for > > fixing lock contention, he was a bit sceptical about it. > > > > But why not? We know that if we fall into this case it's because > > some hardware component has forgotten to release the lock within > > 100ms. So that we have two possibilities, either bail out or > > force the unlock. > > > > Forcing the unlock might not be respectful to the environment, > > but, at the end, i915 should have the highest priority. > > > > Nirmoy's solution here[*] is to force the unlock during gt > > resume, but what happens if meantime the hardware takes the lock > > and doesn't release it? > > > > Open for opinions or profligate rejections :-) > > > > I'm also curious to see what CI has to say about. > > > > [*] https://patchwork.freedesktop.org/series/124397/ > > > > As far as I can tell, this patch doesn't really do anything beneficial > that I can see. We already unlock and proceed today if we hit a lock > timeout: > > - intel_gt_mcr_lock > - attempt to get lock > - timeout, warn, add CI taint > - perform MCR register access even if the lock failed > - intel_gt_mcr_unlock > - lock is released regardless of whether we obtained it successfully > at the beginning, or whether someone else was still holding it > > With your patch, it looks like you're just adding an extra > unlock/reacquire step before we move on which I don't think accomplishes > anything. If someone else forgot to release the lock, then we're still > protected from other agents, and we'll take care of releasing it > ourselves once we're done. If the other agent actually is still using > the lock and they're just going slower than we expected, then when they > finally finish they're just going to blindly unlock; if we're in the > middle of our critical section at that point, they'll release our lock > the same way we released theirs. The main change here is that when we > hit a timeout, your patch is giving other outside agents a chance to sneak in > and re-grab the lock, further delaying our KMD acquisition. > > The real-world IFWI problems we saw, which Nirmoy's series is working > around, is that some boot-time agent simply forgot to ever release the > lock, leaving it locked "forever" so it makes sense to sanitize it > initially. Load/resume is the only time when it's actually "safe" to > reset/sanitize the lock like that. If we're getting MCR timeouts during > regular driver operation (i.e., not during the beginning of driver load > or resume), then it either means our timeout values are too quick (i.e., > we're not giving external agents sufficient time to run their critical > sections), or some piece of system firmware (e.g., pcode) has completely > died in the middle of its critical section. In the former case, we > probably need to adjust our timeout amount (and possibly work with those > firmware teams to see if they can reduce the size of their critical > sections). In the latter case, the system is going to be so badly > broken that it doesn't really matter what we do; we're just not going to > have a functioning system anymore at that point and it's not something > the graphics driver has a way of recovering from. Makes sense... thanks! Andi