On Wed, 13 Mar 2024 15:45:47 +0100 Xaver Hugl <xaver.hugl@xxxxxxx> wrote: > Hi all, > > This was already discussed on IRC, but I think this should be on the > mailing list as well and get some more official conclusion that's > written down somewhere. > > Recently I've experienced a GPU reset, which the system successfully > recovered from, but the display was still stuck - because amdgpu hit a > pageflip timeout, which causes the compositor to wait for a pageflip > event that will never come. Some other experiments I did before showed > that even if the compositor tries submitting new atomic commits after > a timeout, those commits are rejected with EBUSY, presumably because > the timed out pageflip is still considered "pending" on the kernel > side. > > After restarting the compositor, everything continued to work > correctly, so this state can be recovered from. Because of that I > think it would be useful for the kernel to act on pageflip timeouts > differently. It should > - signal the pageflip's completion to userspace > - maybe have a new event for "pageflip failed" to give userspace more > correct information in the future > - allow new commits to happen afterwards > > Another case discussed was when the device is completely removed. > Right now, if a pageflip is pending when that happens, userspace never > gets the event for pageflip completion, just like with the GPU reset. > KWin ignores pending pageflips on hotunplug, because the device is > removed it's not a big issue, but uAPI wise I would expect a pageflip > event to arrive for all commits that request them, no matter what - > and if that is not possible or desirable, uAPI has to be changed, for > example by introducing the mentioned "pageflip failed" event. I agree. From my point of view, after some serious failure in hardware or driver, the main question is: Can already open device fds continue to be used, or not? If the intention is that they can continue to be used, then a page flip event must be eventually delivered if one was expected under normal circumstances. Otherwise userspace cannot continue. Or, if userspace is supposed to employ its own timeout for waiting for the event, then that's is new'ish UAPI, and the device must stop returning EBUSY for new commits. If the intention is that open device fds have become unusable, then the kernel should follow the same policy as for hot-unplug, which is documented at https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug Specifically, EBUSY is an inappropriate error to return in that case. This includes sending the udev event for device removal, and everything that implies. The hardware can then come back as a new device. The case at hand sounds like a driver bug to me. Thanks, pq
Attachment:
pgpj2LgmM57VJ.pgp
Description: OpenPGP digital signature