On 2021-08-02 11:06 a.m., Daniel Vetter wrote: > On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote: >> On 2021-08-02 9:59 a.m., Daniel Vetter wrote: >>> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote: >>>> On 2021-07-30 12:25 p.m., Daniel Vetter wrote: >>>>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: >>>>>> By separating the OUT_FENCE signalling from pageflip completion allows >>>>>> a Guest compositor to start a new repaint cycle with a new buffer >>>>>> instead of waiting for the old buffer to be free. >>>>>> >>>>>> This work is based on the idea/suggestion from Simon and Pekka. >>>>>> >>>>>> This capability can be a solution for this issue: >>>>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514 >>>>>> >>>>>> Corresponding Weston MR: >>>>>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 >>>>> >>>>> Uh I kinda wanted to discuss this a bit more before we jump into typing >>>>> code, but well I guess not that much work yet. >>>>> >>>>> So maybe I'm not understanding the problem, but I think the fundamental >>>>> underlying issue is that with KMS you can have at most 2 buffers >>>>> in-flight, due to our queue depth limit of 1 pending flip. >>>>> >>>>> Unfortunately that means for virtual hw where it takes a few more >>>>> steps/vblanks until the framebuffer actually shows up on screen and is >>>>> scanned out, we suffer deeply. The usual fix for that is to drop the >>>>> latency and increase throughput, and have more buffers in-flight. Which >>>>> this patch tries to do. >>>> >>>> Per >>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , >>>> IMO the underlying issue is actually that the guest compositor repaint >>>> cycle is not aligned with the host compositor one. If they were aligned, >>>> the problem would not occur even without allowing multiple page flips in >>>> flight, and latency would be lower. >>> >>> Yeah my proposal here is under the premise that we do actually need to fix >>> this with a deeper queue depth. >>> >>>>> Now I think where we go wrong here is that we're trying to hack this up by >>>>> defining different semantics for the out-fence and for the drm-event. Imo >>>>> that's wrong, they're both meant to show eactly the same thing: >>>>> - when is the new frame actually visible to the user (as in, eyeballs in a >>>>> human head, preferrably, not the time when we've handed the buffer off >>>>> to the virtual hw) >>>>> - when is the previous buffer no longer being used by the scanout hw >>>>> >>>>> We do cheat a bit right now in so far that we assume they're both the >>>>> same, as in, panel-side latency is currently the compositor's problem to >>>>> figure out. >>>>> >>>>> So for virtual hw I think the timestamp and even completion really need to >>>>> happen only when the buffer has been pushed through the entire >>>>> virtualization chain, i.e. ideally we get the timestamp from the kms >>>>> driver from the host side. Currently that's not done, so this is most >>>>> likely quite broken already (virtio relies on the no-vblank auto event >>>>> sending, which definitely doesn't wait for anything, or I'm completely >>>>> missing something). >>>>> >>>>> I think instead of hacking up some ill-defined 1.5 queue depth support, >>>>> what we should do is support queue depth > 1 properly. So: >>>>> >>>>> - Change atomic to support queue depth > 1, this needs to be a per-driver >>>>> thing due to a bunch of issues in driver code. Essentially drivers must >>>>> never look at obj->state pointers, and only ever look up state through >>>>> the passed-in drm_atomic_state * update container. >>>>> >>>>> - Aside: virtio should loose all it's empty hooks, there's no point in >>>>> that. >>>>> >>>>> - We fix virtio to send out the completion event at the end of this entire >>>>> pipeline, i.e. virtio code needs to take care of sending out the >>>>> crtc_state->event correctly. >>>>> >>>>> - We probably also want some kind of (maybe per-crtc) recommended queue >>>>> depth property so compositors know how many buffers to keep in flight. >>>>> Not sure about that. >>>> >>>> I'd say there would definitely need to be some kind of signal for the >>>> display server that it should queue multiple flips, since this is >>>> normally not desirable for latency. In other words, this wouldn't really >>>> be useful on bare metal (in contrast to the ability to replace a pending >>>> flip with a newer one). >>> >>> Hm I was thinking that the compositor can tune this. If the round-trip >>> latency (as measured by events) is too long to get full refresh rate, it >>> can add more buffers to the queue. That's kinda why I think the returned >>> event really must be accurate wrt actual display time (and old buffer >>> release time), so that this computation in the compositor because a pretty >>> simple >>> >>> num_buffers = (flip_time - submit_time) / frame_time >>> >>> With maybe some rounding up and averaging. You can also hit this when your >>> 3d engine has an extremely deep pipeline (like some of the tiling >>> renders have), where rendering just takes forever, but as long as you keep >>> 2 frames in the renderer in-flight you can achieve full refresh rate (at a >>> latency cost). >> >> As long as a page flip submitted after vblank N can complete during >> vblank N+1, full frame rate can be sustained[0]. User space can use as >> many buffers as needed to keep the rendering pipeline busy. >> >> [0] This is broken by the mis-aligned compositor repaint cycles: The >> flip from the guest compositor misses the host compositor's cycle, so it >> takes more than one display refresh cycle to complete. >> >> >>> So kernel can't really tell you in all cases how many buffers you should >>> have. >> >> That's not exactly what I mean. Right now, KMS user space has to wait >> for a flip to complete before submitting another one, or it gets EBUSY. >> So if the kernel wants to allow multiple flips to be submitted, it has >> to somehow tell user space that this is possible, or it'll never happen. >> And the kernel should never advertise this for bare metal, since it's >> not needed there (and undesirable). > > Oh the existence of the deep queue needs a getcap ofc. > > Also, deep queues do exist in hw (including scheduling to the right > frame), the benefit is reduced battery usage for e.g. video playback if > you do the rendering for an entire set of frames and then just let the > display show them. > > It costs latency ofc (or we need a cancellable queue, which once it's in > hw is tricky). And if it's not in HW, it can be handled by the user-space display server instead of the kernel (except for cancelling pending flips). (Note that this is currently not possible either way with Wayland, since it uses mailbox semantics; there are proposals for a Wayland extension which allows queuing multiple frames though. Meanwhile, this would need to be handled in the Wayland clients.) -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer