Re: [PATCH v3 00/14] drm: Add a driver for CSF-based Mali GPUs

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Mon, 15 Jan 2024 15:18:18 +0100

Hi Faith,

Sorry for the late reply, I only got back to panthor very recently.

On Mon, 11 Dec 2023 12:18:04 -0600
Faith Ekstrand <faith.ekstrand@xxxxxxxxxxxxx> wrote:

> On Mon, 2023-12-11 at 09:52 +0100, Boris Brezillon wrote:
> > Hi,
> > 
> > On Sun, 10 Dec 2023 13:58:51 +0900
> > Tatsuyuki Ishi <ishitatsuyuki@xxxxxxxxx> wrote:
> >   
> > > > On Dec 5, 2023, at 2:32, Boris Brezillon
> > > > <boris.brezillon@xxxxxxxxxxxxx> wrote:
> > > > 
> > > > Hello,
> > > > 
> > > > This is the 3rd version of the kernel driver for Mali CSF-based
> > > > GPUs.
> > > > 
> > > > With all the DRM dependencies being merged (drm-sched single
> > > > entity and
> > > > drm_gpuvm), I thought now was a good time to post a new version.
> > > > Note
> > > > that the iommu series we depend on [1] has been merged recently.
> > > > The
> > > > only remaining dependency that hasn't been merged yet is this
> > > > rather
> > > > trival drm_gpuvm [2] patch.
> > > > 
> > > > As for v2, I pushed a branch based on drm-misc-next and
> > > > containing
> > > > all the dependencies that are not yet available in drm-misc-next
> > > > here[3], and another [4] containing extra patches to have things
> > > > working on rk3588. The CSF firmware binary can be found here[5],
> > > > and
> > > > should be placed under
> > > > /lib/firmware/arm/mali/arch10.8/mali_csffw.bin.
> > > > 
> > > > The mesa MR adding v10 support on top of panthor is available
> > > > here [6].
> > > > 
> > > > Regarding the GPL2+MIT relicensing, Liviu did an audit and found
> > > > two
> > > > more people that I didn't spot initially: Clément Péron for the
> > > > devfreq
> > > > code, and Alexey Sheplyakov for some bits in panthor_gpu.c. Both
> > > > are
> > > > Cc-ed on the relevant patches. The rest of the code is either
> > > > new, or
> > > > covered by the Linaro, Arm and Collabora acks.
> > > > 
> > > > And here is a non-exhaustive changelog, check each commit for a
> > > > detailed
> > > > changelog.
> > > > 
> > > > v3;
> > > > - Quite a few changes at the MMU/sched level to make the fix some
> > > >  race conditions and deadlocks
> > > > - Addition of the a sync-only VM_BIND operation (to support
> > > >  vkQueueSparseBind with zero commands).    
> > > 
> > > Hi Boris,
> > > 
> > > Just wanted to point out that vkQueueBindSparse's semantics is
> > > rather different
> > > from vkQueueSubmit when it comes to synchronization.  In short,
> > > vkQueueBindSparse does not operate on a particular timeline (aka
> > > scheduling
> > > queue / drm_sched_entity).  The property of following a timeline
> > > order is known
> > > as “submission order” [1] in Vulkan, and applies to vkQueueSubmit
> > > only and not
> > > vkQueueBindSparse.  
> > 
> > Hm, okay. I really though the same ordering guarantees applied to
> > sparse binding queues too, as the spec [1] says
> > 
> > "
> > Batches begin execution in the order they appear in pBindInfo, but
> > may complete out of order.
> > "  
> 
> Right. So this is something where the Vulkan spec isn't terribly clear
> and I think I need to go file a spec bug.  I'm fairly sure that the
> intent is that bind operations MAY complete out-of-order but are not
> required to complete out-of-order.  In other words, I'm fairly sure
> that implementations are allowed but not required to insert extra
> dependencies that force some amount of ordering.  We take advantage of
> this in Mesa today to properly implement vkQueueWaitIdle() on sparse
> binding queues.

Do I get it correctly that, for Mesa's generic
vk_common_QueueWaitIdle() implementation to work correctly, we not only
need to guarantee in-order submission, but also in-order completion on
the queue. I mean, that's no problem for panvk/panthor, because that's
how it's implemented anyway, but I didn't realize this constraint
existed until you mentioned it.

> This is also a requirement of Windows WDDM2 as far as
> I can tell so I'd be very surprised if we disallowed it in the spec.
> 
> That said, that's all very unclear and IDK where you'd get that from
> the spec.  I'm going to go file a spec bug so we can get this sorted
> out.
> 
> ~Faith
> 
> 
> > which means things are submitted in order inside a vkQueueSparseBind
> > context, so I was expecting the submission ordering guarantee to
> > apply
> > across vkQueueSparseBind() calls on the same queue too. Just want to
> > mention that all kernel implementations I have seen so far assume
> > VM_BIND submissions on a given queue are serialized (that's how
> > drm_sched works, and Xe, Nouveau and Panthor are basing their VM_BIND
> > implemenation on drm_sched).
> > 
> > Maybe Faith, or anyone deeply involved in the Vulkan specification,
> > can
> > confirm that submission ordering guarantees are relaxed on sparse
> > binding queues.
> >   
> > > 
> > > Hence, an implementation that takes full advantage of Vulkan
> > > semantics would
> > > essentially have an infinite amount of VM_BIND contexts.  
> > 
> > Uh, that's definitely not how I understood it initially...
> >   
> > > It would also not need
> > > sync-only VM_BIND submissions, assuming that drmSyncobjTransfer
> > > works.  
> > 
> > Sure, if each vkQueueSparseBind() has its own timeline, an internal
> > timeline-syncobj with a bunch of drmSyncobjTransfer() calls would do
> > the
> > trick (might require several ioctls() to merge input syncobjs, but
> > that's probably not the end of the world).

Back to the kernel-side implementation. As Tatsuyuki pointed out, we
can always replace the sync-only VM_BIND (no actual operation on the VM,
just a bunch of wait/signal sync operations) with X calls to
drmSyncobjTransfer() and a mesa-driver-specific timeline syncobj that's
used to consolidate all the operations we have submitted so far. But
given Nouveau supports this sync-only operation (at least that's my
understanding of the nouveau_uvmm.c code and how VM_BIND is currently
used in NVK), I guess there are good reasons to support that natively.
Besides the potential optimization on the number of ioctl() calls, and
the fact adding support for VM_BIND with zero VM ops is trivial enough
that it's probably worth adding to simplify the UMD implementation, is
there anything else I'm missing?

Regards,

Boris