Re: Support for 2D engines/blitters in V4L2 and DRM

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Wed, 24 Apr 2019 13:43:03 -0400

Le mercredi 24 avril 2019 à 18:54 +0200, Michel Dänzer a écrit :
> On 2019-04-24 5:44 p.m., Nicolas Dufresne wrote:
> > Le mercredi 24 avril 2019 à 17:06 +0200, Daniel Vetter a écrit :
> > > On Wed, Apr 24, 2019 at 4:41 PM Paul Kocialkowski
> > > <paul.kocialkowski@xxxxxxxxxxx> wrote:
> > > > On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
> > > > > On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
> > > > > > Rendering a video stream is more complex then what you describe here.
> > > > > > Whenever there is a unexpected delay (late delivery of a frame as an
> > > > > > example) you may endup in situation where one frame is ready after the
> > > > > > targeted vblank. If there is another frame that targets the following
> > > > > > vblank that gets ready on-time, the previous frame should be replaced
> > > > > > by the most recent one.
> > > > > > 
> > > > > > With fences, what happens is that even if you received the next frame
> > > > > > on time, naively replacing it is not possible, because we don't know
> > > > > > when the fence for the next frame will be signalled. If you simply
> > > > > > always replace the current frame, you may endup skipping a lot more
> > > > > > vblank then what you expect, and that results in jumpy playback.
> > > > > 
> > > > > So you want to be able to replace a queued flip with another one then.
> > > > > That doesn't necessarily require allowing more than one flip to be
> > > > > queued ahead of time.
> > > > 
> > > > There might be other ways to do it, but this one has plenty of
> > > > advantages.
> > > 
> > > The point of kms (well one of the reasons) was to separate the
> > > implementation of modesetting for specific hw from policy decisions
> > > like which frames to drop and how to schedule them. Kernel gives
> > > tools, userspace implements the actual protocols.
> > > 
> > > There's definitely a bit a gap around scheduling flips for a specific
> > > frame or allowing to cancel/overwrite an already scheduled flip, but
> > > no one yet has come up with a clear proposal for new uapi + example
> > > implementation + userspace implementation + big enough support from
> > > other compositors that this is what they want too.
> 
> Actually, the ATOMIC_AMEND patches propose a way to replace a scheduled
> flip?
> 
> 
> > > > > Note that this can also be done in userspace with explicit fencing (by
> > > > > only selecting a frame and submitting it to the kernel after all
> > > > > corresponding fences have signalled), at least to some degree, but the
> > > > > kernel should be able to do it up to a later point in time and more
> > > > > reliably, with less risk of missing a flip for a frame which becomes
> > > > > ready just in time.
> > > > 
> > > > Indeed, but it would be great if we could do that with implicit fencing
> > > > as well.
> > > 
> > > 1. extract implicit fences from dma-buf. This part is just an idea,
> > > but easy to implement once we have someone who actually wants this.
> > > All we need is a new ioctl on the dma-buf to export the fences from
> > > the reservation_object as a sync_file (either the exclusive or the
> > > shared ones, selected with a flag).
> > > 2. do the exact same frame scheduling as with explicit fencing
> > > 3. supply explicit fences in your atomic ioctl calls - these should
> > > overrule any implicit fences (assuming correct kernel drivers, but we
> > > have helpers so you can assume they all work correctly).
> > > 
> > > By design this is possible, it's just that no one yet bothered enough
> > > to make it happen.
> > > -Daniel
> > 
> > I'm not sure I understand the workflow of this one. I'm all in favour
> > leaving the hard work to userspace. Note that I have assumed explicit
> > fences from the start, I don't think implicit fence will ever exist in
> > v4l2, but I might be wrong. What I understood is that there was a
> > previous attempt in the past but it raised more issues then it actually
> > solved. So that being said, how do handle exactly the follow use cases:
> > 
> >  - A frame was lost by capture driver, but it was schedule as being the
> > next buffer to render (normally previous frame should remain).
> 
> Userspace just doesn't call into the kernel to flip to the lost frame,
> so the previous one remains.

We are stuck in a loop you a me. Considering v4l2 to drm, where fences
don't exist on v4l2, it makes very little sense to bring up fences if
we are to wait on the fence in userspace. Unless of course you have
other operations before end making a proper use of the fences.

> 
> >  - The scheduled frame is late for the next vblank (didn't signal on-
> > time), a new one may be better for the next vlbank, but we will only
> > know when it's fence is signaled.
> 
> Userspace only selects a frame and submits it to the kernel after all
> its fences have signalled.
> 
> > Better in this context means the the presentation time of this frame is
> > closer to the next vblank time. Keep in mind that the idea is to
> > schedule the frames before they are signal, in order to make the usage
> > of the fence useful in lowering the latency.
> 
> Fences are about signalling completion, not about low latency.

It can be used to remove a roundtrip with userspace at a very time
sensitive moment. If you pass a dmabuf with it's unsignalled fence to a
kernel driver, the driver can start the job on this dmabuf as soon as
the fence is signalled. If you always wait on a fence in userspace, you
have to wait for the userspace process to be scheduled, then userspace
will setup the drm atomic request or similar action, which may take
some time and may require another process in the kernel to have to be
schedule. This effectively adds some variable delay, a gap where
nothing is happening between two operations. This time is lost and
contributes to the overall operation latency.

The benefit of fences we are looking for is being able to setup before
the fence is signalled the operations on various compatible drivers.
This way, on the time critical moment a driver can be feed more jobs,
there is no userspace rountrip involved. It is also proposed to use it
to return the buffers into v4l2 queued when they are freed, which can
in some conditions avoid let's say a capture driver from skipping due
to random scheduling delays.

> 
> With a display server, the client can send frames to the display server
> ahead of time, only the display server needs to wait for fences to
> signal before submitting frames to the kernel.
> 
> 
> > Of course as Michel said, we could just always wait on the fence and
> > just schedule. But if you do that, why would you care implementing the
> > fence in v4l2 to start with, DQBuf does just that already.
> 
> A fence is more likely to work out of the box with non-V4L-related code
> than DQBuf?

If you use DQBuf, you are guarantied that the data has been produced. A
fence is not useful on a buffer that already contains the data you
would be waiting for. That's why the fence is provided in the RFC at
QBUf, basically when the  free buffer is given to the v4l2 driver. QBuf
can also be passed a fence in the RFC, so if the buffer is not yet
free, the driver would wait on the fence before using it.

> 
> 
Attachment:
signature.asc

Description: This is a digitally signed message part