Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)

Daniel Vetter <daniel@xxxxxxxx> · Thu, 27 May 2021 12:19:31 +0200

On Wed, May 26, 2021 at 11:32:49PM -0500, Jason Ekstrand wrote:
> On May 26, 2021 13:15:08 Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:
> 
> > Hey,
> > 
> > On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:
> > > On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:
> > > > Typing out the Wayland protocol isn't the hard bit. If we just need to
> > > > copy and sed syncobj to weirdsyncobj, no problem really, and it gives
> > > > us a six-month head start on painful compositor-internal surgery
> > > > whilst we work on common infrastructure to ship userspace fences
> > > > around (mappable dmabuf with the sync bracketing? FD where every
> > > > read() gives you the current value? memfd? other?).
> > > 
> > > I feel like I should elaborate more about timelines.  In my earlier
> > > reply, my commentary about timeline syncobj was mostly focused around
> > > helping people avoid typing.  That's not really the full story,
> > > though, and I hope more context will help.
> > > 
> > > First, let me say that timeline syncobj was designed as a mechanism to
> > > implement VK_KHR_timeline_semaphore without inserting future fences
> > > into the kernel.  It's entirely designed around the needs of Vulkan
> > > drivers, not really as a window-system primitive.  The semantics are
> > > designed around one driver communicating to another that new fences
> > > have been added and it's safe to kick off more rendering.  I'm not
> > > convinced that it's the right object for window-systems and I'm also
> > > not convinced that it's a good idea to try and make a version of it
> > > that's a wrapper around a userspace memory fence.  (I'm going to start
> > > typing UMF for userspace memory fence because it's long to type out.)
> > > 
> > > Why?  Well, the fundamental problem with timelines in general is
> > > trying to figure out when it's about to be done.  But timeline syncobj
> > > solves this for us!  It gives us this fancy super-useful ioctl!
> > > Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
> > > syncobj that's a wrapper around a userspace memory fence.  What do we
> > > do with that ioctl?  As I mentioned above, the kernel doesn't have any
> > > clue when it will be triggered so that ioctl turns into an actual
> > > wait.  That's no good because it creates unnecessary stalls.
> > 
> > Yeah, I'm assuming that UMF will be a separate primitive. No problem.
> > I also think that your submitted/completed thing is a non-problem: at
> > this stage we're just throwing up our hands and admitting that we're
> > letting userspace tie itself in knots, and giving it the tools to tie
> > a sufficiently un-streetwise compositor in knots too. We're already
> > crossing that Rubicon, so let's just embrace it and not try to design
> > it out. Us compositors can handle the scheduling, really.
> 
> Ok, good. I think we're on the same page.
> 
> > 
> > > There's another potential solution here:  Have each UMF be two
> > > timelines: submitted and completed.  At the start of every batch
> > > that's supposed to trigger a UMF, we set the "submitted" side and
> > > then, when it completes, we set the "completed" side.  Ok, great, now
> > > we can get at the "about to be done" with the submitted side,
> > > implement the ioctl, and we're all good, right?  Sadly, no.  There's
> > > no guarantee about how long a "batch" takes.  So there's no universal
> > > timeout the kernel can apply.  Also, if it does time out, the kernel
> > > doesn't know who to blame for the timeout and how to prevent itself
> > > from getting in trouble again.  The compositor does so, in theory,
> > > given the right ioctls, it could detect the -ETIME and kill that
> > > client.  Not a great solution.
> > > 
> > > The best option I've been able to come up with for this is some sort
> > > of client-provided signal.  Something where it says, as part of submit
> > > or somewhere else, "I promise I'll be done soon" where that promise
> > > comes with dire consequences if it's not.  At that point, we can turn
> > > the UMF and a particular wait value into a one-shot fence like a
> > > dma_fence or sync_file, or signal a syncobj on it.  If it ever times
> > > out, we kick their context.  In Vulkan terminology, they get
> > > VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
> > > that it's based on a client-provided thing.  With a fully timeline
> > > model and wait-before-signal, we can't infer when something is about
> > > to be done.  Only the client knows when it submitted its last node in
> > > the dependency graph and the whole mess is unblocked.  Second, is that
> > > the dma_fence is created within the client's driver context.  If it's
> > > created compositor-side, the kernel doesn't know who to blame if
> > > things go badly.  If we create it in the client, it's pretty easy to
> > > make context death on -ETIME part of the contract.
> > > 
> > > (Before danvet jumps in here and rants about how UMF -> dma_fence
> > > isn't possible, I haven't forgotten.  I'm pretending, for now, that
> > > we've solved some of those problems.)
> > 
> > Funny how we've come full circle to the original proposal here ...
> > 
> > If we really want a kernel primitive for this - and I think it's a
> > good idea, since can help surface 'badness' in a way which is
> > observable by e.g. session managers in a way analogous to cgroup stats
> > and controls - how about this for a counter-proposal? Client exports a
> > FD for its context/queue and sends it to winsys as part of setup,
> > compositor can ioctl() on that to kill it, which lands in the same
> > zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger
> > hammer than per-sync-point primitives, but you as a client have to
> > accept the social contract that if you want to participate in a
> > session, your context has to be making forward progress and you aren't
> > writing cheques the compositor can't cash.
> 
> The compositor already has that. It can kick the client's Wayland protocol
> connection. Banning the context from the kernel might be nice too but
> kicking it is probably sufficient.
> 
> Side-note to danvet: Do we need a plan for UMF with persistent contexts? My
> gut says that's a very bad idea but this made me think I should say least
> pose the question.

Yeah UMF ctx mode needs to require non-persisten ctx mode too or it just
goes horribly wrong everywhere.

> > I'm also going to pre-emptively agree with other-Dan; I'm extremely
> > wary of anything which tries to make UMF look even a little bit like
> > sync_file. The requirements to support them are so wildly different
> > that I'd almost rather a totally orthogonal interface so that there's
> > no danger of confusing the two. Us sophisticates on this thread can
> > eat the mild annoyance of typing out separate codepaths, but it's much
> > worse for anyone else who may look at a UMF wolf in dma_fence sheep's
> > clothing then only later be substantially more annoyed when they
> > realise that it's not anything like they thought it was.
> > 
> > So let's keep sync_file for what it is, and for UMF since the usage is
> > so radically different, build out whatever we do around making the
> > uAPI as useful as possible for what we want to do with it. The real
> > complexity in handling the difference between UMF and 'real' fences is
> > in how they behave, not in how they look.
> 
> Sounds good.
> 
> > 
> > > Another option is to just stall on the UMF until it's done.  Yeah,
> > > kind-of terrible and high-latency, but it always works and doesn't
> > > involve any complex logic to kill clients.  If a client never gets
> > > around to signaling a fence, it just never repaints.  The compositor
> > > keeps going like nothing's wrong.  Maybe, if the client submits lots
> > > of frames without ever triggering, it'll hit some max queue depth
> > > somewhere and kill it but that's it.  More likely, the client's
> > > vkAcquireNextImage will start timing out and it'll crash.
> > > 
> > > I suspect where we might actually land is some combination of the two
> > > depending on client choice.  If the client wants to be dumb, it gets
> > > the high-latency always-works path.  If the client really wants
> > > lowest-latency VRR, it has to take the smarter path and risk
> > > VK_ERROR_DEVICE_LOST if it misses too far.
> > 
> > We already have to handle unresponsive clients. If your browser
> > livelocks today (say because it's Chrome and you hotunplug your
> > monitor at the wrong time with active media playback in an inactive
> > tab in an inactive window ... hypothetically),
> 
> That's an oddly specific hypothetical...
> 
> > yourr Wayland server
> > notices that it isn't responding to pings, throws up the 'do you want
> > to force-quit?' dialog and kills the client; it's actually really
> > simple logic. So we just hook unsignaled fences up to the same. (And,
> > if we have the context-kill primitive, trigger that on our way out.)
> > 
> > So yeah, we already have all the complexity points to put particular
> > surface trees in limbo (thanks to subsurface sync mode), we already
> > have all the complexity points to separate realised surface trees from
> > pixels on screen, and we already have the complexity points for
> > different parts of the surface trees being rendered at different
> > times. Checking on fence progression is just a little more typing
> > around those interface points which already exist, and zapping clients
> > is utterly trivial.
> 
> 👍
> 
> > 
> > > But the point of all of this is, neither of the above two paths have
> > > anything to do with the compositor calling a "wait for submit" ioctl.
> > > Building a design around that and baking it into protocol is, IMO, a
> > > mistake.  I don't see any valid way to handle this mess without "wait
> > > for sumbit" either not existing or existing only client-side for the
> > > purposes of WSI.
> > 
> > I'm still on the fence (sorry) about a wait-before-submit ioctl. For
> > the sync_file-based timeline syncobjs that we have today, yes it is
> > helpful, and we do already have it, it's just the wrong shape in being
> > sleep rather than epoll.
> 
> I still don't see why we're still talking about timeline syncobj...
> 
> > For UMF, taking it as a given that the kernel really has no visibility
> > at all into syncpoint progress, then the kernel is conceptually a
> > worse place to spin-sleep than userspace is, because why account the
> > CPU burn to a kthread rather than a real PID, and lose
> > latency/efficiency on context switches when you do wake up?
> > 
> > But also, the kernel is conceptually the best place to spin-sleep,
> > because it can fuse waits and do better wakeup quantisation than
> > userspace can. And I'm still hopeful that the IHVs and Windows can
> > both step away from the postmodern 'synchronisation doesn't mean
> > anything anymore, just poll in a lap-burning loop' approach that we've
> > been presented (sorry) with, where we at least get doorbells which
> > allow the kernel to do polling much smarter than quantising timers
> > ('this fence might not have triggered yet, but _something_ happened
> > which might have triggered it so why not check?').
> 
> I think we can and do do something better than just poll on the memory. I'm
> not sure on the details but I've been told that we can set some sort of
> interrupt-like thing on the address so it's not actually a spin. Even
> without that, done hardware has some way that a command buffer can trigger
> an interrupt. If the protocol is to write memory and then trigger an
> interrupt rather than just write memory, that gives us something if a
> doorbell. Not as convenient, maybe, but it'd help with power consumption,
> etc.

Spoiler but the i915 plan is to indeed suck less than cpu spinning on
memory. The icky part is converting this into a protocol that can be used
across processes and stuff like that.
-Daniel

> 
> --Jason
> 
> > On the other other hand, the only winsys case for burning poll in a
> > tight loop is flipping as quickly as possible straight to a VRR
> > display. In that case, you're definitely running on mains power so
> > you're just melting the polar ice caps rather than your lap, and
> > you've got everything fully lit up anyway so the power cost of polling
> > is immaterial. For FRR, the compositor already has a fixed deadline at
> > which it will wake up and make a hard binding decision about which
> > image to present - this includes XR as well. So we don't have to worry
> > about optimising a polling loop, because there isn't one: we wake up
> > once, we check once, and if the client's missed then too bad, try
> > again next frame.
> > 
> > As you can see, much like userspace memory fences, my position on
> > which way to go here is not knowable upfront, and depends on when
> > exactly you observe it. Hopefully someone can come back with an
> > argument compelling enough either way that I have something better to
> > do than to try to pun my way out of having more hands than Ganesh. I
> > don't think it's material to the design or implementation of winsys
> > support though.
> > 
> > Cheers,
> > Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch