On Tue, Apr 20, 2021 at 9:10 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Tue, Apr 20, 2021 at 1:59 PM Christian König > <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > > > > Yeah. If we go with userspace fences, then userspace can hang itself. Not > > > the kernel's problem. > > > > Well, the path of inner peace begins with four words. “Not my fucking > > problem.” > > > > But I'm not that much concerned about the kernel, but rather about > > important userspace processes like X, Wayland, SurfaceFlinger etc... > > > > I mean attaching a page to a sync object and allowing to wait/signal > > from both CPU as well as GPU side is not so much of a problem. > > > > > You have to somehow handle that, e.g. perhaps with conditional > > > rendering and just using the old frame in compositing if the new one > > > doesn't show up in time. > > > > Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level. > > For opengl we do all the same guarantees, so if you get one of these > you just block until the fence is signalled. Doing that properly means > submit thread to support drm_syncobj like for vulkan. > > For vulkan we probably want to represent these as proper vk timeline > objects, and the vulkan way is to just let the application (well > compositor) here deal with it. If they import timelines from untrusted > other parties, they need to handle the potential fallback of being > lied at. How is "not vulkan's fucking problem", because that entire > "with great power (well performance) comes great responsibility" is > the entire vk design paradigm. The security aspects are currently an unsolved problem in Vulkan. The assumption is that everyone trusts everyone else to be careful with the scissors. It's a great model! I think we can do something in Vulkan to allow apps to protect themselves a bit but it's tricky and non-obvious. --Jason > Glamour will just rely on GL providing nice package of the harsh > reality of gpus, like usual. > > So I guess step 1 here for GL would be to provide some kind of > import/export of timeline syncobj, including properly handling this > "future/indefinite fences" aspect of them with submit thread and > everything. > > -Daniel > > > > > Regards, > > Christian. > > > > Am 20.04.21 um 13:16 schrieb Daniel Vetter: > > > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote: > > >> Daniel, are you suggesting that we should skip any deadlock prevention in > > >> the kernel, and just let userspace wait for and signal any fence it has > > >> access to? > > > Yeah. If we go with userspace fences, then userspace can hang itself. Not > > > the kernel's problem. The only criteria is that the kernel itself must > > > never rely on these userspace fences, except for stuff like implementing > > > optimized cpu waits. And in those we must always guarantee that the > > > userspace process remains interruptible. > > > > > > It's a completely different world from dma_fence based kernel fences, > > > whether those are implicit or explicit. > > > > > >> Do you have any concern with the deprecation/removal of BO fences in the > > >> kernel assuming userspace is only using explicit fences? Any concern with > > >> the submit and return fences for modesetting and other producer<->consumer > > >> scenarios? > > > Let me work on the full replay for your rfc first, because there's a lot > > > of details here and nuance. > > > -Daniel > > > > > >> Thanks, > > >> Marek > > >> > > >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > >> > > >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König > > >>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > >>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand: > > >>>>> Not going to comment on everything on the first pass... > > >>>>> > > >>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@xxxxxxxxx> wrote: > > >>>>>> Hi, > > >>>>>> > > >>>>>> This is our initial proposal for explicit fences everywhere and new > > >>> memory management that doesn't use BO fences. It's a redesign of how Linux > > >>> graphics drivers work, and it can coexist with what we have now. > > >>>>>> > > >>>>>> 1. Introduction > > >>>>>> (skip this if you are already sold on explicit fences) > > >>>>>> > > >>>>>> The current Linux graphics architecture was initially designed for > > >>> GPUs with only one graphics queue where everything was executed in the > > >>> submission order and per-BO fences were used for memory management and > > >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple > > >>> queues were added on top, which required the introduction of implicit > > >>> GPU-GPU synchronization between queues of different processes using per-BO > > >>> fences. Recently, even parallel execution within one queue was enabled > > >>> where a command buffer starts draws and compute shaders, but doesn't wait > > >>> for them, enabling parallelism between back-to-back command buffers. > > >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler > > >>> was created to enable all those use cases, and it's the only reason why the > > >>> scheduler exists. > > >>>>>> The GPU scheduler, implicit synchronization, BO-fence-based memory > > >>> management, and the tracking of per-BO fences increase CPU overhead and > > >>> latency, and reduce parallelism. There is a desire to replace all of them > > >>> with something much simpler. Below is how we could do it. > > >>>>>> > > >>>>>> 2. Explicit synchronization for window systems and modesetting > > >>>>>> > > >>>>>> The producer is an application and the consumer is a compositor or a > > >>> modesetting driver. > > >>>>>> 2.1. The Present request > > >>>>>> > > >>>>>> As part of the Present request, the producer will pass 2 fences (sync > > >>> objects) to the consumer alongside the presented DMABUF BO: > > >>>>>> - The submit fence: Initially unsignalled, it will be signalled when > > >>> the producer has finished drawing into the presented buffer. > > >>>>>> - The return fence: Initially unsignalled, it will be signalled when > > >>> the consumer has finished using the presented buffer. > > >>>>> I'm not sure syncobj is what we want. In the Intel world we're trying > > >>>>> to go even further to something we're calling "userspace fences" which > > >>>>> are a timeline implemented as a single 64-bit value in some > > >>>>> CPU-mappable BO. The client writes a higher value into the BO to > > >>>>> signal the timeline. > > >>>> Well that is exactly what our Windows guys have suggested as well, but > > >>>> it strongly looks like that this isn't sufficient. > > >>>> > > >>>> First of all you run into security problems when any application can > > >>>> just write any value to that memory location. Just imagine an > > >>>> application sets the counter to zero and X waits forever for some > > >>>> rendering to finish. > > >>> The thing is, with userspace fences security boundary issue prevent > > >>> moves into userspace entirely. And it really doesn't matter whether > > >>> the event you're waiting on doesn't complete because the other app > > >>> crashed or was stupid or intentionally gave you a wrong fence point: > > >>> You have to somehow handle that, e.g. perhaps with conditional > > >>> rendering and just using the old frame in compositing if the new one > > >>> doesn't show up in time. Or something like that. So trying to get the > > >>> kernel involved but also not so much involved sounds like a bad design > > >>> to me. > > >>> > > >>>> Additional to that in such a model you can't determine who is the guilty > > >>>> queue in case of a hang and can't reset the synchronization primitives > > >>>> in case of an error. > > >>>> > > >>>> Apart from that this is rather inefficient, e.g. we don't have any way > > >>>> to prevent priority inversion when used as a synchronization mechanism > > >>>> between different GPU queues. > > >>> Yeah but you can't have it both ways. Either all the scheduling in the > > >>> kernel and fence handling is a problem, or you actually want to > > >>> schedule in the kernel. hw seems to definitely move towards the more > > >>> stupid spinlock-in-hw model (and direct submit from userspace and all > > >>> that), priority inversions be damned. I'm really not sure we should > > >>> fight that - if it's really that inefficient then maybe hw will add > > >>> support for waiting sync constructs in hardware, or at least be > > >>> smarter about scheduling other stuff. E.g. on intel hw both the kernel > > >>> scheduler and fw scheduler knows when you're spinning on a hw fence > > >>> (whether userspace or kernel doesn't matter) and plugs in something > > >>> else. Add in a bit of hw support to watch cachelines, and you have > > >>> something which can handle both directions efficiently. > > >>> > > >>> Imo given where hw is going, we shouldn't try to be too clever here. > > >>> The only thing we do need to provision is being able to do cpu side > > >>> waits without spinning. And that should probably be done in a fairly > > >>> gpu specific way still. > > >>> -Daniel > > >>> > > >>>> Christian. > > >>>> > > >>>>> The kernel then provides some helpers for > > >>>>> waiting on them reliably and without spinning. I don't expect > > >>>>> everyone to support these right away but, If we're going to re-plumb > > >>>>> userspace for explicit synchronization, I'd like to make sure we take > > >>>>> this into account so we only have to do it once. > > >>>>> > > >>>>> > > >>>>>> Deadlock mitigation to recover from segfaults: > > >>>>>> - The kernel knows which process is obliged to signal which fence. > > >>> This information is part of the Present request and supplied by userspace. > > >>>>> This isn't clear to me. Yes, if we're using anything dma-fence based > > >>>>> like syncobj, this is true. But it doesn't seem totally true as a > > >>>>> general statement. > > >>>>> > > >>>>> > > >>>>>> - If the producer crashes, the kernel signals the submit fence, so > > >>> that the consumer can make forward progress. > > >>>>>> - If the consumer crashes, the kernel signals the return fence, so > > >>> that the producer can reclaim the buffer. > > >>>>>> - A GPU hang signals all fences. Other deadlocks will be handled like > > >>> GPU hangs. > > >>>>> What do you mean by "all"? All fences that were supposed to be > > >>>>> signaled by the hung context? > > >>>>> > > >>>>> > > >>>>>> Other window system requests can follow the same idea. > > >>>>>> > > >>>>>> Merged fences where one fence object contains multiple fences will be > > >>> supported. A merged fence is signalled only when its fences are signalled. > > >>> The consumer will have the option to redefine the unsignalled return fence > > >>> to a merged fence. > > >>>>>> 2.2. Modesetting > > >>>>>> > > >>>>>> Since a modesetting driver can also be the consumer, the present > > >>> ioctl will contain a submit fence and a return fence too. One small problem > > >>> with this is that userspace can hang the modesetting driver, but in theory, > > >>> any later present ioctl can override the previous one, so the unsignalled > > >>> presentation is never used. > > >>>>>> > > >>>>>> 3. New memory management > > >>>>>> > > >>>>>> The per-BO fences will be removed and the kernel will not know which > > >>> buffers are busy. This will reduce CPU overhead and latency. The kernel > > >>> will not need per-BO fences with explicit synchronization, so we just need > > >>> to remove their last user: buffer evictions. It also resolves the current > > >>> OOM deadlock. > > >>>>> Is this even really possible? I'm no kernel MM expert (trying to > > >>>>> learn some) but my understanding is that the use of per-BO dma-fence > > >>>>> runs deep. I would like to stop using it for implicit synchronization > > >>>>> to be sure, but I'm not sure I believe the claim that we can get rid > > >>>>> of it entirely. Happy to see someone try, though. > > >>>>> > > >>>>> > > >>>>>> 3.1. Evictions > > >>>>>> > > >>>>>> If the kernel wants to move a buffer, it will have to wait for > > >>> everything to go idle, halt all userspace command submissions, move the > > >>> buffer, and resume everything. This is not expected to happen when memory > > >>> is not exhausted. Other more efficient ways of synchronization are also > > >>> possible (e.g. sync only one process), but are not discussed here. > > >>>>>> 3.2. Per-process VRAM usage quota > > >>>>>> > > >>>>>> Each process can optionally and periodically query its VRAM usage > > >>> quota and change domains of its buffers to obey that quota. For example, a > > >>> process allocated 2 GB of buffers in VRAM, but the kernel decreased the > > >>> quota to 1 GB. The process can change the domains of the least important > > >>> buffers to GTT to get the best outcome for itself. If the process doesn't > > >>> do it, the kernel will choose which buffers to evict at random. (thanks to > > >>> Christian Koenig for this idea) > > >>>>> This is going to be difficult. On Intel, we have some resources that > > >>>>> have to be pinned to VRAM and can't be dynamically swapped out by the > > >>>>> kernel. In GL, we probably can deal with it somewhat dynamically. In > > >>>>> Vulkan, we'll be entirely dependent on the application to use the > > >>>>> appropriate Vulkan memory budget APIs. > > >>>>> > > >>>>> --Jason > > >>>>> > > >>>>> > > >>>>>> 3.3. Buffer destruction without per-BO fences > > >>>>>> > > >>>>>> When the buffer destroy ioctl is called, an optional fence list can > > >>> be passed to the kernel to indicate when it's safe to deallocate the > > >>> buffer. If the fence list is empty, the buffer will be deallocated > > >>> immediately. Shared buffers will be handled by merging fence lists from all > > >>> processes that destroy them. Mitigation of malicious behavior: > > >>>>>> - If userspace destroys a busy buffer, it will get a GPU page fault. > > >>>>>> - If userspace sends fences that never signal, the kernel will have a > > >>> timeout period and then will proceed to deallocate the buffer anyway. > > >>>>>> 3.4. Other notes on MM > > >>>>>> > > >>>>>> Overcommitment of GPU-accessible memory will cause an allocation > > >>> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory > > >>> might not be supported. > > >>>>>> Kernel drivers could move to this new memory management today. Only > > >>> buffer residency and evictions would stop using per-BO fences. > > >>>>>> > > >>>>>> 4. Deprecating implicit synchronization > > >>>>>> > > >>>>>> It can be phased out by introducing a new generation of hardware > > >>> where the driver doesn't add support for it (like a driver fork would do), > > >>> assuming userspace has all the changes for explicit synchronization. This > > >>> could potentially create an isolated part of the kernel DRM where all > > >>> drivers only support explicit synchronization. > > >>>>>> Marek > > >>>>>> _______________________________________________ > > >>>>>> dri-devel mailing list > > >>>>>> dri-devel@xxxxxxxxxxxxxxxxxxxxx > > >>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > >>>>> _______________________________________________ > > >>>>> mesa-dev mailing list > > >>>>> mesa-dev@xxxxxxxxxxxxxxxxxxxxx > > >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev > > >>> > > >>> -- > > >>> Daniel Vetter > > >>> Software Engineer, Intel Corporation > > >>> http://blog.ffwll.ch > > >>> > > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel