Hi Daniel, > > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > > By separating the OUT_FENCE signalling from pageflip completion allows > > a Guest compositor to start a new repaint cycle with a new buffer > > instead of waiting for the old buffer to be free. > > > > This work is based on the idea/suggestion from Simon and Pekka. > > > > This capability can be a solution for this issue: > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > > > Corresponding Weston MR: > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > Uh I kinda wanted to discuss this a bit more before we jump into typing > code, but well I guess not that much work yet. [Kasireddy, Vivek] Right, it wasn't a lot of work :) > > So maybe I'm not understanding the problem, but I think the fundamental > underlying issue is that with KMS you can have at most 2 buffers > in-flight, due to our queue depth limit of 1 pending flip. [Kasireddy, Vivek] Let me summarize the problem again from the perspective of both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms. Host compositor: - After a pageflip completion event, it starts its next repaint cycle by waiting for 9 ms and then submits the atomic commit and at the tail end of its cycle sends a frame callback event to all its clients (who registered and submitted frames) indicating to them to start their next redraw -- giving them at-least ~16 ms to submit a new frame to be included in its next repaint. Why a configurable 9 ms delay is needed is explained in Pekka's blog post here: https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html - It'll send a wl_buffer.release event for a client submitted previous buffer only when the client has submitted a new buffer and: a) When it hasn't started its repaint cycle yet OR b) When it clears its old state after it gets a pageflip completion event -- if it had flipped the client's buffer onto a hardware plane. Guest compositor: - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for the Guest compositor to submit a new atomic commit. This time of 10-12 ms includes the 9 ms wait -- just like the Host compositor -- for its clients to submit new buffers. - When it gets a pageflip completion, it assumes that the previously submitted buffer is free for re-use and uses it again -- resulting in the usage of only 2 out of a maximum of 4 backbuffers included as part of the Mesa GBM surface implementation. Guest KMS/Virtio-gpu/Qemu Wayland UI: - Because no_vblank=true for Guest KMS and since the vblank event (which also serves as the pageflip completion event for user-space) is sent right away after atomic commit, as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we know for sure that the Host is completely done using the buffer. To ensure this, we signal the dma-fence only after the Host compositor sends a wl_buffer.release event or an equivalent signal. The goal: - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware plane on the Host -- regardless of either compositor's scheduling policy -- without making any copies and ensuring that both Host and Guest are not accessing the buffer at the same time. The problem: - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer) onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate. The solution: - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including the 9 ms wait) when the Host compositor sends the frame callback event to its clients. In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the Guest compositor has to be forced to use a new buffer for its next repaint cycle when it gets a pageflip completion. - The Weston MR I linked above does this by getting an out_fence fd and taking a reference on all the FBs included in the atomic commit forcing the compositor to use new FBs for its next repaint cycle. It releases the references when the out_fence is signalled later when the Host compositor sends a wl_buffer.release event. > > Unfortunately that means for virtual hw where it takes a few more > steps/vblanks until the framebuffer actually shows up on screen and is > scanned out, we suffer deeply. The usual fix for that is to drop the > latency and increase throughput, and have more buffers in-flight. Which > this patch tries to do. > > Now I think where we go wrong here is that we're trying to hack this up by > defining different semantics for the out-fence and for the drm-event. Imo > that's wrong, they're both meant to show eactly the same thing: > - when is the new frame actually visible to the user (as in, eyeballs in a > human head, preferrably, not the time when we've handed the buffer off > to the virtual hw) > - when is the previous buffer no longer being used by the scanout hw [Kasireddy, Vivek] Right, they both mean the same thing but I think using both at the same time would be redundant in the case of Weston. That's why I am trying to repurpose the usage of out_fence in this case by introducing a new capability that may not be relevant for bare-metal KMS drivers but would be useful for virtual KMS drivers. > > We do cheat a bit right now in so far that we assume they're both the > same, as in, panel-side latency is currently the compositor's problem to > figure out. > > So for virtual hw I think the timestamp and even completion really need to > happen only when the buffer has been pushed through the entire > virtualization chain, i.e. ideally we get the timestamp from the kms > driver from the host side. Currently that's not done, so this is most > likely quite broken already (virtio relies on the no-vblank auto event > sending, which definitely doesn't wait for anything, or I'm completely > missing something). [Kasireddy, Vivek] You are right; virtio_gpu does use the no_vblank auto event but as I mentioned above we do use an internal dma-fence to wait until the submitted buffer is no longer used by the Host. In other words, we wait (in update_planes hook) until we get an appropriate signal from the Host to proceed to make sure that we are not rendering faster than what the Host can display. However, as you suggest below, we could set no_vblank=false and send the vblank/ pageflip completion event from the virtio-gpu driver instead of having the DRM core send it. This can prevent the DRM core from signalling the out_fence as well which is my intended objective and what my first patch tries to do. I'd still need the new capability though to include the patch in Weston that deals with out_fence -- unless Weston upstream can accept the patch after reviewing it without this newly added capability which would be redundant but it does solve my problem. Would this be acceptable? > > I think instead of hacking up some ill-defined 1.5 queue depth support, > what we should do is support queue depth > 1 properly. So: > > - Change atomic to support queue depth > 1, this needs to be a per-driver > thing due to a bunch of issues in driver code. Essentially drivers must > never look at obj->state pointers, and only ever look up state through > the passed-in drm_atomic_state * update container. > > - Aside: virtio should loose all it's empty hooks, there's no point in > that. > > - We fix virtio to send out the completion event at the end of this entire > pipeline, i.e. virtio code needs to take care of sending out the > crtc_state->event correctly. > > - We probably also want some kind of (maybe per-crtc) recommended queue > depth property so compositors know how many buffers to keep in flight. > Not sure about that. > > It's a bit more work, but also a lot less hacking around infrastructure in > dubious ways. > > Thoughts? [Kasireddy, Vivek] IIUC, you are suggesting that we should make it possible to submit a new atomic commit even though the completion event for the previous one has not come in yet. This may potentially solve my problem but it sounds very disruptive and not very useful for bare-metal cases. It also means that the compositors, DRM core and the drivers need to keep track of multiple states -- as opposed to new and old -- for all objects such as crtcs, planes, etc and account for multiple completion events. I guess it is doable but as you suggest it seems like a lot of work with many pitfalls ahead. Thanks, Vivek > > Cheers, Daniel > > > > > Cc: Daniel Vetter <daniel@xxxxxxxx> > > Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx> > > Cc: Pekka Paalanen <ppaalanen@xxxxxxxxx> > > Cc: Simon Ser <contact@xxxxxxxxxxx> > > Cc: Michel Dänzer <michel@xxxxxxxxxxx> > > Cc: Tina Zhang <tina.zhang@xxxxxxxxx> > > Cc: Dongwon Kim <dongwon.kim@xxxxxxxxx> > > > > Vivek Kasireddy (4): > > drm: Add a capability flag to support deferred out_fence signalling > > virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature > > drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd > > drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature > > > > drivers/gpu/drm/drm_file.c | 11 +++--- > > drivers/gpu/drm/drm_ioctl.c | 3 ++ > > drivers/gpu/drm/virtio/virtgpu_debugfs.c | 1 + > > drivers/gpu/drm/virtio/virtgpu_drv.c | 1 + > > drivers/gpu/drm/virtio/virtgpu_drv.h | 6 ++++ > > drivers/gpu/drm/virtio/virtgpu_fence.c | 9 +++++ > > drivers/gpu/drm/virtio/virtgpu_kms.c | 10 ++++-- > > drivers/gpu/drm/virtio/virtgpu_plane.c | 44 +++++++++++++++++++++++- > > drivers/gpu/drm/virtio/virtgpu_vq.c | 17 +++++++++ > > include/drm/drm_mode_config.h | 9 +++++ > > include/uapi/drm/drm.h | 1 + > > include/uapi/linux/virtio_gpu.h | 12 +++++++ > > 12 files changed, 117 insertions(+), 7 deletions(-) > > > > -- > > 2.30.2 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch