Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Jason Ekstrand <jason@xxxxxxxxxxxxxx> · Tue, 7 Jun 2022 12:12:03 -0500

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@xxxxxxxxx> wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:

>   On 02/06/2022 23:35, Jason Ekstrand wrote:

>

>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura

>     <niranjana.vishwanathapura@xxxxxxxxx> wrote:

>

>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:

>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

>       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding

>       the mapping in an

>       >> > +async worker. The binding and unbinding will work like a special

>       GPU engine.

>       >> > +The binding and unbinding operations are serialized and will

>       wait on specified

>       >> > +input fences before the operation and will signal the output

>       fences upon the

>       >> > +completion of the operation. Due to serialization, completion of

>       an operation

>       >> > +will also indicate that all previous operations are also

>       complete.

>       >>

>       >> I guess we should avoid saying "will immediately start

>       binding/unbinding" if

>       >> there are fences involved.

>       >>

>       >> And the fact that it's happening in an async worker seem to imply

>       it's not

>       >> immediate.

>       >>

>

>       Ok, will fix.

>       This was added because in earlier design binding was deferred until

>       next execbuff.

>       But now it is non-deferred (immediate in that sense). But yah, this is

>       confusing

>       and will fix it.

>

>       >>

>       >> I have a question on the behavior of the bind operation when no

>       input fence

>       >> is provided. Let say I do :

>       >>

>       >> VM_BIND (out_fence=fence1)

>       >>

>       >> VM_BIND (out_fence=fence2)

>       >>

>       >> VM_BIND (out_fence=fence3)

>       >>

>       >>

>       >> In what order are the fences going to be signaled?

>       >>

>       >> In the order of VM_BIND ioctls? Or out of order?

>       >>

>       >> Because you wrote "serialized I assume it's : in order

>       >>

>

>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind

>       will use

>       the same queue and hence are ordered.

>

>       >>

>       >> One thing I didn't realize is that because we only get one

>       "VM_BIND" engine,

>       >> there is a disconnect from the Vulkan specification.

>       >>

>       >> In Vulkan VM_BIND operations are serialized but per engine.

>       >>

>       >> So you could have something like this :

>       >>

>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

>       >>

>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

>       >>

>       >>

>       >> fence1 is not signaled

>       >>

>       >> fence3 is signaled

>       >>

>       >> So the second VM_BIND will proceed before the first VM_BIND.

>       >>

>       >>

>       >> I guess we can deal with that scenario in userspace by doing the

>       wait

>       >> ourselves in one thread per engines.

>       >>

>       >> But then it makes the VM_BIND input fences useless.

>       >>

>       >>

>       >> Daniel : what do you think? Should be rework this or just deal with

>       wait

>       >> fences in userspace?

>       >>

>       >

>       >My opinion is rework this but make the ordering via an engine param

>       optional.

>       >

>       >e.g. A VM can be configured so all binds are ordered within the VM

>       >

>       >e.g. A VM can be configured so all binds accept an engine argument

>       (in

>       >the case of the i915 likely this is a gem context handle) and binds

>       >ordered with respect to that engine.

>       >

>       >This gives UMDs options as the later likely consumes more KMD

>       resources

>       >so if a different UMD can live with binds being ordered within the VM

>       >they can use a mode consuming less resources.

>       >

>

>       I think we need to be careful here if we are looking for some out of

>       (submission) order completion of vm_bind/unbind.

>       In-order completion means, in a batch of binds and unbinds to be

>       completed in-order, user only needs to specify in-fence for the

>       first bind/unbind call and the our-fence for the last bind/unbind

>       call. Also, the VA released by an unbind call can be re-used by

>       any subsequent bind call in that in-order batch.

>

>       These things will break if binding/unbinding were to be allowed to

>       go out of order (of submission) and user need to be extra careful

>       not to run into pre-mature triggereing of out-fence and bind failing

>       as VA is still in use etc.

>

>       Also, VM_BIND binds the provided mapping on the specified address

>       space

>       (VM). So, the uapi is not engine/context specific.

>

>       We can however add a 'queue' to the uapi which can be one from the

>       pre-defined queues,

>       I915_VM_BIND_QUEUE_0

>       I915_VM_BIND_QUEUE_1

>       ...

>       I915_VM_BIND_QUEUE_(N-1)

>

>       KMD will spawn an async work queue for each queue which will only

>       bind the mappings on that queue in the order of submission.

>       User can assign the queue to per engine or anything like that.

>

>       But again here, user need to be careful and not deadlock these

>       queues with circular dependency of fences.

>

>       I prefer adding this later an as extension based on whether it

>       is really helping with the implementation.

>

>     I can tell you right now that having everything on a single in-order

>     queue will not get us the perf we want.  What vulkan really wants is one

>     of two things:

>      1. No implicit ordering of VM_BIND ops.  They just happen in whatever

>     their dependencies are resolved and we ensure ordering ourselves by

>     having a syncobj in the VkQueue.

>      2. The ability to create multiple VM_BIND queues.  We need at least 2

>     but I don't see why there needs to be a limit besides the limits the

>     i915 API already has on the number of engines.  Vulkan could expose

>     multiple sparse binding queues to the client if it's not arbitrarily

>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already

has on the number of engines"? I am not sure if there is such an uapi today.

There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2.  I think someone had an extended version that allowed more but I ripped it out because no one was using it.  Of course, execbuffer3 might not have that problem at all.

I am trying to see how many queues we need and don't want it to be arbitrarily

large and unduely blow up memory usage and complexity in i915 driver.

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely.  As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N.  As for memory usage, creating more queues means more memory.  That's a trade-off that userspace can make.  Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

>     Why?  Because Vulkan has two basic kind of bind operations and we don't

>     want any dependencies between them:

>      1. Immediate.  These happen right after BO creation or maybe as part of

>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a

>     queue and we don't want them serialized with anything.  To synchronize

>     with submit, we'll have a syncobj in the VkDevice which is signaled by

>     all immediate bind operations and make submits wait on it.

>      2. Queued (sparse): These happen on a VkQueue which may be the same as

>     a render/compute queue or may be its own queue.  It's up to us what we

>     want to advertise.  From the Vulkan API PoV, this is like any other

>     queue.  Operations on it wait on and signal semaphores.  If we have a

>     VM_BIND engine, we'd provide syncobjs to wait and signal just like we do

>     in execbuf().

>     The important thing is that we don't want one type of operation to block

>     on the other.  If immediate binds are blocking on sparse binds, it's

>     going to cause over-synchronization issues.

>     In terms of the internal implementation, I know that there's going to be

>     a lock on the VM and that we can't actually do these things in

>     parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues

feeding to it.

Right.  As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

>     unblocked to do the bind operation, I don't care if there's a bit of

>     synchronization due to locking.  That's expected.  What we can't afford

>     to have is an immediate bind operation suddenly blocking on a sparse

>     operation which is blocked on a compute job that's going to run for

>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND

on other VMs. I am not sure about usecases here, but just wanted to clarify.

Yes, that's what I would expect.

--Jason

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many paging

>     queues (what they call a VM_BIND engine/queue).  That design works

>     pretty well and solves the problems in question.  Again, we could just

>     make everything out-of-order and require using syncobjs to order things

>     as userspace wants. That'd be fine too.

>     One more note while I'm here: danvet said something on IRC about VM_BIND

>     queues waiting for syncobjs to materialize.  We don't really want/need

>     this.  We already have all the machinery in userspace to handle

>     wait-before-signal and waiting for syncobj fences to materialize and

>     that machinery is on by default.  It would actually take MORE work in

>     Mesa to turn it off and take advantage of the kernel being able to wait

>     for syncobjs to materialize.  Also, getting that right is ridiculously

>     hard and I really don't want to get it wrong in kernel space.  When we

>     do memory fences, wait-before-signal will be a thing.  We don't need to

>     try and make it a thing for syncobj.

>     --Jason

>

>   Thanks Jason,

>

>   I missed the bit in the Vulkan spec that we're allowed to have a sparse

>   queue that does not implement either graphics or compute operations :

>

>     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT

>     support in queue families that also include

>

>      graphics and compute support, other implementations may only expose a

>     VK_QUEUE_SPARSE_BINDING_BIT-only queue

>

>      family."

>

>   So it can all be all a vm_bind engine that just does bind/unbind

>   operations.

>

>   But yes we need another engine for the immediate/non-sparse operations.

>

>   -Lionel

>

>      

>

>       Daniel, any thoughts?

>

>       Niranjana

>

>       >Matt

>       >

>       >>

>       >> Sorry I noticed this late.

>       >>

>       >>

>       >> -Lionel

>       >>

>       >>