On Wed, Sep 04, 2024 at 09:49:44AM +0200, Christian König wrote: > Am 03.09.24 um 23:11 schrieb Simona Vetter: > > On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote: > > > Hi Steven, > > > > > > Am 29.08.24 um 15:37 schrieb Steven Price: > > > > Hi Christian, > > > > > > > > Mihail should be able to give more definitive answers, but I think I can > > > > answer your questions. > > > > > > > > On 29/08/2024 10:40, Christian König wrote: > > > > > Am 28.08.24 um 19:25 schrieb Mihail Atanassov: > > > > > > Hello all, > > > > > > > > > > > > This series implements a mechanism to expose Mali CSF GPUs' queue > > > > > > ringbuffers directly to userspace, along with paraphernalia to allow > > > > > > userspace to control job synchronisation between the CPU and GPU. > > > > > > > > > > > > The goal of these changes is to allow userspace to control work > > > > > > submission to the FW/HW directly without kernel intervention in the > > > > > > common case, thereby reducing context switching overhead. It also allows > > > > > > for greater flexibility in the way work is enqueued in the ringbufs. > > > > > > For example, the current kernel submit path only supports indirect > > > > > > calls, which is inefficient for small command buffers. Userspace can > > > > > > also skip unnecessary sync operations. > > > > > Question is how do you guarantee forward progress for fence signaling? > > > > A timeout. Although looking at it I think it's probably set too high > > > > currently: > > > > > > > > > +#define JOB_TIMEOUT_MS 5000 > > > > But basically the XGS queue is a DRM scheduler just like a normal GPU > > > > queue and the jobs have a timeout. If the timeout is hit then any fences > > > > will be signalled (with an error). > > > Mhm, that is unfortunately exactly what I feared. > > > > > > > > E.g. when are fences created and published? How do they signal? > > > > > > > > > > How are dependencies handled? How can the kernel suspend an userspace > > > > > queue? > > > > The actual userspace queue can be suspended. This is actually a > > > > combination of firmware and kernel driver, and this functionality is > > > > already present without the user submission. The firmware will multiplex > > > > multiple 'groups' onto the hardware, and if there are too many for the > > > > firmware then the kernel multiplexes the extra groups onto the ones the > > > > firmware supports. > > > How do you guarantee forward progress and that resuming of suspended queues > > > doesn't end up in a circle dependency? > > > > > > > I haven't studied Mihail's series in detail yet, but if I understand > > > > correctly, the XGS queues are handled separately and are not suspended > > > > when the hardware queues are suspended. I guess this might be an area > > > > for improvement and might explain the currently very high timeout (to > > > > deal with the case where the actual GPU work has been suspended). > > > > > > > > > How does memory management work in this case? > > > > I'm not entirely sure what you mean here. If you are referring to the > > > > potential memory issues with signalling path then this should be handled > > > > by the timeout - although I haven't studied the code to check for bugs here. > > > You might have misunderstood my question (and I might misunderstand the > > > code), but on first glance it strongly sounds like the current approach will > > > be NAKed. > > > > > > > The actual new XGS queues don't allocate/free memory during the queue > > > > execution - so it's just the memory usage related to fences (and the > > > > other work which could be blocked on the fence). > > > But the kernel and the hardware could suspend the queues, right? > > > > > > > In terms of memory management for the GPU work itself, this is handled > > > > the same as before. The VM_BIND mechanism allows dependencies to be > > > > created between syncobjs and VM operations, with XGS these can then be > > > > tied to GPU HW events. > > > I don't know the details, but that again strongly sounds like that won't > > > work. > > > > > > What you need is to somehow guarantee that work doesn't run into memory > > > management deadlocks which are resolved by timeouts. > > > > > > Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences > > panthor doesn't yet have a shrinker, so all memory is pinned, which means > > memory management easy mode. > > Ok, that at least makes things work for the moment. > > > But also this means there might be an uapi design bug in here, and we > > really don't want to commit to that. My stance is that panthor should gain > > a proper shrinker first, which means there will be some changes here too. > > And then we can properly evaluate this. As-is it's a bit too much on the > > toy end of things. > > I wouldn't say toy end, it looks rather fleshed out to me. > > But I agree that the people who design the UAPI needs to be aware of the > restrictions. The thing is, if you have pinned memory there are no restrictions on userspace sync, you can do whatever you want. And then you can still bake that into a dma_fence and it will not deadlock (as long as there's a timeout and forward progress guarantee) because it cannot ever get back into a memory management fence because there are none. If we merge that uapi we're forever condemned to pinning all memory, which I don't think is good because sooner or later every serious gpu driver seems to gain some shrinker support. > > That aside, I've thought this all through with tons of people, and I do > > think it's all possible. > > It's certainly possible, we have user queue patches for amdgpu in the > pipeline as well. > > It's just really really really hard to get right without creating some > circle dependencies and deadlocks in between. > > If I would get free beer every time somebody came up with a broken dma_fence > design I would probably end up as alcoholic without spending a single penny. Yeah that's exactly my fear. It's possible, but even if you understand the abstract rules and implement the kernel correctly it's really hard to get the userspace right: There's way too many broken ways, and just about barely one correct way to implement vulkan winsys dma_fence with userspace submit. This is why I called userspace direct submit without a shrinker a toy, because it makes the userspace trivially easy, and trivially easy to get wrong. Even if the kernel side looks all great and polished and solid. If we do already have a shrinker that exposes these issues, then userspace getting stuff wrong is a userspace bug. If we start out without a shrinker, and userspace gets it wrong, then ever adding a shrinker is a kernel regression. And "you cannot ever add dynamic memory management to this driver" is a uapi regression situation I don't think we want to get into, because there's no way out. i915-gem folks managed to pull that trick off too, and Dave&me where left with no other choice than to flat out revert that uapi and break userspace. This is not something I want to make a repeat experience. So what I think should be done, pretty much carbon copy from the xe plan: - Prerequisite: Have a shrinker or otherwise dynamic memory management, or you just wont hit all the interesting bugs. - Implement support for long running context that do not have dma_fence end-of-batch semantics, with the preempt context fence trick. Since this would be the 3rd driver after amdkfd and xe I think it's time for a little helper, which would be very little code and lots of documentation explaining the concept. But if that helper ties in with drm_exec and drm_gpuvm then I think it would still be really good as a tiny little library itself and not just as a good place to put documentation. Extracting that from xe_lr.c should be fairly simple. amdkfd could also be used as inspiration at least. - With this you can start using userspace memory fences, as long as you never need to bake them into a dma_fence or syncobj that actually waits (instead of waiting until rendering finishes and then handing over a signalled dma_fence or something like that). - Implement userspace submit with this infrastructure. - Implement support for baking dma_fence with all this, relying on the vulkan winsys fence guarantee that all the indefinite fences are resolved and queued up, or the application will get a VK_ERROR_DEVICE_LOST thrown its way. This is really hard, and I don't think anyone's made it happen yet in a real prototype with full-fledged userspace memory fences vulkan driver. - Bonus for adding memory fences to drm_syncobj as future fences, for sharing. Could be done anytime you have long running context support. Alternative plan: - Implement userspace submit, but _every_ job still goes through a kernel drm_sched submission and the context doesn't run without a job. That way you can drive userspace-submit only hardware, but with the old end-of-batch dma_fence kernel submission model, and also the userspace submit thread for future fences that haven't materialized yet in a drm_syncobj. This might be a good interim stop-gap for gl/vulkan winsys drivers, but people seem a lot more interesting in going full userspace submit with userspace memory fences for compute use cases. But could do both in parallel. Unless you absolutely know what you're doing you cannot use userspace memory fences with this. I think the only case I've seen is where you submit jobs in multiple engines as an atomic unit (like intel media workloads split across engines), or one after the other and sync is guaranteed to go only one way (like the run-ahead cs job radeonsi iirc has). tldr; expect pain, I'm sorry. Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch