Re: [RFC PATCH 00/10] drm/panthor: Add user submission

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Wed, 4 Sep 2024 13:23:08 +0200

On Wed, 4 Sep 2024 10:31:36 +0100
Steven Price <steven.price@xxxxxxx> wrote:

> On 04/09/2024 08:49, Christian König wrote:
> > Am 03.09.24 um 23:11 schrieb Simona Vetter:  
> >> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:  
> >>> Hi Steven,
> >>>
> >>> Am 29.08.24 um 15:37 schrieb Steven Price:  
> >>>> Hi Christian,
> >>>>
> >>>> Mihail should be able to give more definitive answers, but I think I
> >>>> can
> >>>> answer your questions.
> >>>>
> >>>> On 29/08/2024 10:40, Christian König wrote:  
> >>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:  
> >>>>>> Hello all,
> >>>>>>
> >>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
> >>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
> >>>>>> userspace to control job synchronisation between the CPU and GPU.
> >>>>>>
> >>>>>> The goal of these changes is to allow userspace to control work
> >>>>>> submission to the FW/HW directly without kernel intervention in the
> >>>>>> common case, thereby reducing context switching overhead. It also
> >>>>>> allows
> >>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
> >>>>>> For example, the current kernel submit path only supports indirect
> >>>>>> calls, which is inefficient for small command buffers. Userspace can
> >>>>>> also skip unnecessary sync operations.  
> >>>>> Question is how do you guarantee forward progress for fence signaling?  
> >>>> A timeout. Although looking at it I think it's probably set too high
> >>>> currently:
> >>>>  
> >>>>> +#define JOB_TIMEOUT_MS                5000  
> >>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
> >>>> queue and the jobs have a timeout. If the timeout is hit then any
> >>>> fences
> >>>> will be signalled (with an error).  
> >>> Mhm, that is unfortunately exactly what I feared.
> >>>  
> >>>>> E.g. when are fences created and published? How do they signal?
> >>>>>
> >>>>> How are dependencies handled? How can the kernel suspend an userspace
> >>>>> queue?  
> >>>> The actual userspace queue can be suspended. This is actually a
> >>>> combination of firmware and kernel driver, and this functionality is
> >>>> already present without the user submission. The firmware will
> >>>> multiplex
> >>>> multiple 'groups' onto the hardware, and if there are too many for the
> >>>> firmware then the kernel multiplexes the extra groups onto the ones the
> >>>> firmware supports.  
> >>> How do you guarantee forward progress and that resuming of suspended
> >>> queues
> >>> doesn't end up in a circle dependency?  
> 
> I'm not entirely sure what you mean by "guarantee" here - the kernel by
> itself only guarantees forward progress by the means of timeouts. User
> space can 'easily' shoot itself in the foot by using a XGS queue to
> block waiting on a GPU event which will never happen.
> 
> However dependencies between applications (and/or other device drivers)
> will only occur via dma fences and an unsignalled fence will only be
> returned when there is a path forward to signal it. So it shouldn't be
> possible to create a dependency loop between contexts (or command stream
> groups to use the Mali jargon).
> 
> Because the groups can't have dependency cycles it should be possible to
> suspend/resume them without deadlocks.
> 
> >>>> I haven't studied Mihail's series in detail yet, but if I understand
> >>>> correctly, the XGS queues are handled separately and are not suspended
> >>>> when the hardware queues are suspended. I guess this might be an area
> >>>> for improvement and might explain the currently very high timeout (to
> >>>> deal with the case where the actual GPU work has been suspended).
> >>>>  
> >>>>> How does memory management work in this case?  
> >>>> I'm not entirely sure what you mean here. If you are referring to the
> >>>> potential memory issues with signalling path then this should be
> >>>> handled
> >>>> by the timeout - although I haven't studied the code to check for
> >>>> bugs here.  
> >>> You might have misunderstood my question (and I might misunderstand the
> >>> code), but on first glance it strongly sounds like the current
> >>> approach will
> >>> be NAKed.
> >>>  
> >>>> The actual new XGS queues don't allocate/free memory during the queue
> >>>> execution - so it's just the memory usage related to fences (and the
> >>>> other work which could be blocked on the fence).  
> >>> But the kernel and the hardware could suspend the queues, right?
> >>>  
> >>>> In terms of memory management for the GPU work itself, this is handled
> >>>> the same as before. The VM_BIND mechanism allows dependencies to be
> >>>> created between syncobjs and VM operations, with XGS these can then be
> >>>> tied to GPU HW events.  
> >>> I don't know the details, but that again strongly sounds like that won't
> >>> work.
> >>>
> >>> What you need is to somehow guarantee that work doesn't run into memory
> >>> management deadlocks which are resolved by timeouts.
> >>>
> >>> Please read up here on why that stuff isn't allowed:
> >>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences  
> >> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >> memory management easy mode.  
> > 
> > Ok, that at least makes things work for the moment.  
> 
> Ah, perhaps this should have been spelt out more clearly ;)
> 
> The VM_BIND mechanism that's already in place jumps through some hoops
> to ensure that memory is preallocated when the memory operations are
> enqueued. So any memory required should have been allocated before any
> sync object is returned. We're aware of the issue with memory
> allocations on the signalling path and trying to ensure that we don't
> have that.
> 
> I'm hoping that we don't need a shrinker which deals with (active) GPU
> memory with our design.

That's actually what we were planning to do: the panthor shrinker was
about to rely on fences attached to GEM objects to know if it can
reclaim the memory. This design relies on each job attaching its fence
to the GEM mapped to the VM at the time the job is submitted, such that
memory that's in-use or about-to-be-used doesn't vanish before the GPU
is done.

> Memory which user space thinks the GPU might
> need should be pinned before the GPU work is submitted. APIs which
> require any form of 'paging in' of data would need to be implemented by
> the GPU work completing and being resubmitted by user space after the
> memory changes (i.e. there could be a DMA fence pending on the GPU work).

Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
that means we can't really transparently swap out GPU memory, or we
have to constantly pin/unpin around each job, which means even more
ioctl()s than we have now. Another option would be to add the XGS fence
to the BOs attached to the VM, assuming it's created before the job
submission itself, but you're no longer reducing the number of user <->
kernel round trips if you do that, because you now have to create an
XSG job for each submission, so you basically get back to one ioctl()
per submission.