On May 26, 2021 13:15:08 Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:
Hey,On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:Typing out the Wayland protocol isn't the hard bit. If we just need tocopy and sed syncobj to weirdsyncobj, no problem really, and it givesus a six-month head start on painful compositor-internal surgerywhilst we work on common infrastructure to ship userspace fencesaround (mappable dmabuf with the sync bracketing? FD where everyread() gives you the current value? memfd? other?).I feel like I should elaborate more about timelines. In my earlierreply, my commentary about timeline syncobj was mostly focused aroundhelping people avoid typing. That's not really the full story,though, and I hope more context will help.First, let me say that timeline syncobj was designed as a mechanism toimplement VK_KHR_timeline_semaphore without inserting future fencesinto the kernel. It's entirely designed around the needs of Vulkandrivers, not really as a window-system primitive. The semantics aredesigned around one driver communicating to another that new fenceshave been added and it's safe to kick off more rendering. I'm notconvinced that it's the right object for window-systems and I'm alsonot convinced that it's a good idea to try and make a version of itthat's a wrapper around a userspace memory fence. (I'm going to starttyping UMF for userspace memory fence because it's long to type out.)Why? Well, the fundamental problem with timelines in general istrying to figure out when it's about to be done. But timeline syncobjsolves this for us! It gives us this fancy super-useful ioctl!Right? Uh.... not as well as I'd like. Let's say we make a timelinesyncobj that's a wrapper around a userspace memory fence. What do wedo with that ioctl? As I mentioned above, the kernel doesn't have anyclue when it will be triggered so that ioctl turns into an actualwait. That's no good because it creates unnecessary stalls.Yeah, I'm assuming that UMF will be a separate primitive. No problem.I also think that your submitted/completed thing is a non-problem: atthis stage we're just throwing up our hands and admitting that we'reletting userspace tie itself in knots, and giving it the tools to tiea sufficiently un-streetwise compositor in knots too. We're alreadycrossing that Rubicon, so let's just embrace it and not try to designit out. Us compositors can handle the scheduling, really.
Ok, good. I think we're on the same page.
There's another potential solution here: Have each UMF be twotimelines: submitted and completed. At the start of every batchthat's supposed to trigger a UMF, we set the "submitted" side andthen, when it completes, we set the "completed" side. Ok, great, nowwe can get at the "about to be done" with the submitted side,implement the ioctl, and we're all good, right? Sadly, no. There'sno guarantee about how long a "batch" takes. So there's no universaltimeout the kernel can apply. Also, if it does time out, the kerneldoesn't know who to blame for the timeout and how to prevent itselffrom getting in trouble again. The compositor does so, in theory,given the right ioctls, it could detect the -ETIME and kill thatclient. Not a great solution.The best option I've been able to come up with for this is some sortof client-provided signal. Something where it says, as part of submitor somewhere else, "I promise I'll be done soon" where that promisecomes with dire consequences if it's not. At that point, we can turnthe UMF and a particular wait value into a one-shot fence like adma_fence or sync_file, or signal a syncobj on it. If it ever timesout, we kick their context. In Vulkan terminology, they getVK_ERROR_DEVICE_LOST. There are two important bits here: First, isthat it's based on a client-provided thing. With a fully timelinemodel and wait-before-signal, we can't infer when something is aboutto be done. Only the client knows when it submitted its last node inthe dependency graph and the whole mess is unblocked. Second, is thatthe dma_fence is created within the client's driver context. If it'screated compositor-side, the kernel doesn't know who to blame ifthings go badly. If we create it in the client, it's pretty easy tomake context death on -ETIME part of the contract.(Before danvet jumps in here and rants about how UMF -> dma_fenceisn't possible, I haven't forgotten. I'm pretending, for now, thatwe've solved some of those problems.)Funny how we've come full circle to the original proposal here ...If we really want a kernel primitive for this - and I think it's agood idea, since can help surface 'badness' in a way which isobservable by e.g. session managers in a way analogous to cgroup statsand controls - how about this for a counter-proposal? Client exports aFD for its context/queue and sends it to winsys as part of setup,compositor can ioctl() on that to kill it, which lands in the samezap/zap/zap/zap/ban codepath as GPU hangs do today. It's a biggerhammer than per-sync-point primitives, but you as a client have toaccept the social contract that if you want to participate in asession, your context has to be making forward progress and you aren'twriting cheques the compositor can't cash.
The compositor already has that. It can kick the client's Wayland protocol connection. Banning the context from the kernel might be nice too but kicking it is probably sufficient.
Side-note to danvet: Do we need a plan for UMF with persistent contexts? My gut says that's a very bad idea but this made me think I should say least pose the question.
I'm also going to pre-emptively agree with other-Dan; I'm extremelywary of anything which tries to make UMF look even a little bit likesync_file. The requirements to support them are so wildly differentthat I'd almost rather a totally orthogonal interface so that there'sno danger of confusing the two. Us sophisticates on this thread caneat the mild annoyance of typing out separate codepaths, but it's muchworse for anyone else who may look at a UMF wolf in dma_fence sheep'sclothing then only later be substantially more annoyed when theyrealise that it's not anything like they thought it was.So let's keep sync_file for what it is, and for UMF since the usage isso radically different, build out whatever we do around making theuAPI as useful as possible for what we want to do with it. The realcomplexity in handling the difference between UMF and 'real' fences isin how they behave, not in how they look.
Sounds good.
Another option is to just stall on the UMF until it's done. Yeah,kind-of terrible and high-latency, but it always works and doesn'tinvolve any complex logic to kill clients. If a client never getsaround to signaling a fence, it just never repaints. The compositorkeeps going like nothing's wrong. Maybe, if the client submits lotsof frames without ever triggering, it'll hit some max queue depthsomewhere and kill it but that's it. More likely, the client'svkAcquireNextImage will start timing out and it'll crash.I suspect where we might actually land is some combination of the twodepending on client choice. If the client wants to be dumb, it getsthe high-latency always-works path. If the client really wantslowest-latency VRR, it has to take the smarter path and riskVK_ERROR_DEVICE_LOST if it misses too far.We already have to handle unresponsive clients. If your browserlivelocks today (say because it's Chrome and you hotunplug yourmonitor at the wrong time with active media playback in an inactivetab in an inactive window ... hypothetically),
That's an oddly specific hypothetical...
yourr Wayland servernotices that it isn't responding to pings, throws up the 'do you wantto force-quit?' dialog and kills the client; it's actually reallysimple logic. So we just hook unsignaled fences up to the same. (And,if we have the context-kill primitive, trigger that on our way out.)So yeah, we already have all the complexity points to put particularsurface trees in limbo (thanks to subsurface sync mode), we alreadyhave all the complexity points to separate realised surface trees frompixels on screen, and we already have the complexity points fordifferent parts of the surface trees being rendered at differenttimes. Checking on fence progression is just a little more typingaround those interface points which already exist, and zapping clientsis utterly trivial.
👍
But the point of all of this is, neither of the above two paths haveanything to do with the compositor calling a "wait for submit" ioctl.Building a design around that and baking it into protocol is, IMO, amistake. I don't see any valid way to handle this mess without "waitfor sumbit" either not existing or existing only client-side for thepurposes of WSI.I'm still on the fence (sorry) about a wait-before-submit ioctl. Forthe sync_file-based timeline syncobjs that we have today, yes it ishelpful, and we do already have it, it's just the wrong shape in beingsleep rather than epoll.
I still don't see why we're still talking about timeline syncobj...
For UMF, taking it as a given that the kernel really has no visibilityat all into syncpoint progress, then the kernel is conceptually aworse place to spin-sleep than userspace is, because why account theCPU burn to a kthread rather than a real PID, and loselatency/efficiency on context switches when you do wake up?But also, the kernel is conceptually the best place to spin-sleep,because it can fuse waits and do better wakeup quantisation thanuserspace can. And I'm still hopeful that the IHVs and Windows canboth step away from the postmodern 'synchronisation doesn't meananything anymore, just poll in a lap-burning loop' approach that we'vebeen presented (sorry) with, where we at least get doorbells whichallow the kernel to do polling much smarter than quantising timers('this fence might not have triggered yet, but _something_ happenedwhich might have triggered it so why not check?').
I think we can and do do something better than just poll on the memory. I'm not sure on the details but I've been told that we can set some sort of interrupt-like thing on the address so it's not actually a spin. Even without that, done hardware has some way that a command buffer can trigger an interrupt. If the protocol is to write memory and then trigger an interrupt rather than just write memory, that gives us something if a doorbell. Not as convenient, maybe, but it'd help with power consumption, etc.
--Jason
On the other other hand, the only winsys case for burning poll in atight loop is flipping as quickly as possible straight to a VRRdisplay. In that case, you're definitely running on mains power soyou're just melting the polar ice caps rather than your lap, andyou've got everything fully lit up anyway so the power cost of pollingis immaterial. For FRR, the compositor already has a fixed deadline atwhich it will wake up and make a hard binding decision about whichimage to present - this includes XR as well. So we don't have to worryabout optimising a polling loop, because there isn't one: we wake uponce, we check once, and if the client's missed then too bad, tryagain next frame.As you can see, much like userspace memory fences, my position onwhich way to go here is not knowable upfront, and depends on whenexactly you observe it. Hopefully someone can come back with anargument compelling enough either way that I have something better todo than to try to pun my way out of having more hands than Ganesh. Idon't think it's material to the design or implementation of winsyssupport though.Cheers,Daniel