Re: Handling pageflip timeouts

Pekka Paalanen <pekka.paalanen@xxxxxxxxxxxxx> · Wed, 20 Mar 2024 10:41:45 +0200

On Wed, 13 Mar 2024 15:45:47 +0100
Xaver Hugl <xaver.hugl@xxxxxxx> wrote:

> Hi all,
> 
> This was already discussed on IRC, but I think this should be on the
> mailing list as well and get some more official conclusion that's
> written down somewhere.
> 
> Recently I've experienced a GPU reset, which the system successfully
> recovered from, but the display was still stuck - because amdgpu hit a
> pageflip timeout, which causes the compositor to wait for a pageflip
> event that will never come. Some other experiments I did before showed
> that even if the compositor tries submitting new atomic commits after
> a timeout, those commits are rejected with EBUSY, presumably because
> the timed out pageflip is still considered "pending" on the kernel
> side.
> 
> After restarting the compositor, everything continued to work
> correctly, so this state can be recovered from. Because of that I
> think it would be useful for the kernel to act on pageflip timeouts
> differently. It should
> - signal the pageflip's completion to userspace
> - maybe have a new event for "pageflip failed" to give userspace more
> correct information in the future
> - allow new commits to happen afterwards
> 
> Another case discussed was when the device is completely removed.
> Right now, if a pageflip is pending when that happens, userspace never
> gets the event for pageflip completion, just like with the GPU reset.
> KWin ignores pending pageflips on hotunplug, because the device is
> removed it's not a big issue, but uAPI wise I would expect a pageflip
> event to arrive for all commits that request them, no matter what -
> and if that is not possible or desirable, uAPI has to be changed, for
> example by introducing the mentioned "pageflip failed" event.

I agree.

From my point of view, after some serious failure in hardware or
driver, the main question is:

Can already open device fds continue to be used, or not?

If the intention is that they can continue to be used, then a page flip
event must be eventually delivered if one was expected under normal
circumstances. Otherwise userspace cannot continue. Or, if userspace is
supposed to employ its own timeout for waiting for the event, then
that's is new'ish UAPI, and the device must stop returning EBUSY for new
commits.

If the intention is that open device fds have become unusable, then the
kernel should follow the same policy as for hot-unplug, which is
documented at
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug

Specifically, EBUSY is an inappropriate error to return in that case.

This includes sending the udev event for device removal, and everything
that implies. The hardware can then come back as a new device.

The case at hand sounds like a driver bug to me.

Thanks,
pq
Attachment:
pgpj2LgmM57VJ.pgp

Description: OpenPGP digital signature