Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Niranjana Vishwanathapura <niranjana.vishwanathapura@xxxxxxxxx> · Wed, 8 Jun 2022 14:44:32 -0700

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:

On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
 On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
 <niranjana.vishwanathapura@xxxxxxxxx> wrote:

   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
   >
   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
   >     <niranjana.vishwanathapura@xxxxxxxxx> wrote:
   >
   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew 
Brost wrote:
   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
   wrote:
   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
   >       >> > +VM_BIND/UNBIND ioctl will immediately start
   binding/unbinding
   >       the mapping in an
   >       >> > +async worker. The binding and unbinding will 
work like a
   special
   >       GPU engine.
   >       >> > +The binding and unbinding operations are serialized and
   will
   >       wait on specified
   >       >> > +input fences before the operation and will signal the
   output
   >       fences upon the
   >       >> > +completion of the operation. Due to serialization,
   completion of
   >       an operation
   >       >> > +will also indicate that all previous operations 
are also
   >       complete.
   >       >>
   >       >> I guess we should avoid saying "will immediately start
   >       binding/unbinding" if
   >       >> there are fences involved.
   >       >>
   >       >> And the fact that it's happening in an async 
worker seem to
   imply
   >       it's not
   >       >> immediate.
   >       >>
   >
   >       Ok, will fix.
   >       This was added because in earlier design binding was deferred
   until
   >       next execbuff.
   >       But now it is non-deferred (immediate in that sense). 
But yah,
   this is
   >       confusing
   >       and will fix it.
   >
   >       >>
   >       >> I have a question on the behavior of the bind 
operation when
   no
   >       input fence
   >       >> is provided. Let say I do :
   >       >>
   >       >> VM_BIND (out_fence=fence1)
   >       >>
   >       >> VM_BIND (out_fence=fence2)
   >       >>
   >       >> VM_BIND (out_fence=fence3)
   >       >>
   >       >>
   >       >> In what order are the fences going to be signaled?
   >       >>
   >       >> In the order of VM_BIND ioctls? Or out of order?
   >       >>
   >       >> Because you wrote "serialized I assume it's : in order
   >       >>
   >
   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that 
bind and
   unbind
   >       will use
   >       the same queue and hence are ordered.
   >
   >       >>
   >       >> One thing I didn't realize is that because we only get one
   >       "VM_BIND" engine,
   >       >> there is a disconnect from the Vulkan specification.
   >       >>
   >       >> In Vulkan VM_BIND operations are serialized but 
per engine.
   >       >>
   >       >> So you could have something like this :
   >       >>
   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
   >       >>
   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
   >       >>
   >       >>
   >       >> fence1 is not signaled
   >       >>
   >       >> fence3 is signaled
   >       >>
   >       >> So the second VM_BIND will proceed before the 
first VM_BIND.
   >       >>
   >       >>
   >       >> I guess we can deal with that scenario in 
userspace by doing
   the
   >       wait
   >       >> ourselves in one thread per engines.
   >       >>
   >       >> But then it makes the VM_BIND input fences useless.
   >       >>
   >       >>
   >       >> Daniel : what do you think? Should be rework this or just
   deal with
   >       wait
   >       >> fences in userspace?
   >       >>
   >       >
   >       >My opinion is rework this but make the ordering via 
an engine
   param
   >       optional.
   >       >
   >       >e.g. A VM can be configured so all binds are ordered 
within the
   VM
   >       >
   >       >e.g. A VM can be configured so all binds accept an engine
   argument
   >       (in
   >       >the case of the i915 likely this is a gem context 
handle) and
   binds
   >       >ordered with respect to that engine.
   >       >
   >       >This gives UMDs options as the later likely consumes 
more KMD
   >       resources
   >       >so if a different UMD can live with binds being 
ordered within
   the VM
   >       >they can use a mode consuming less resources.
   >       >
   >
   >       I think we need to be careful here if we are looking for some
   out of
   >       (submission) order completion of vm_bind/unbind.
   >       In-order completion means, in a batch of binds and 
unbinds to be
   >       completed in-order, user only needs to specify 
in-fence for the
   >       first bind/unbind call and the our-fence for the last
   bind/unbind
   >       call. Also, the VA released by an unbind call can be 
re-used by
   >       any subsequent bind call in that in-order batch.
   >
   >       These things will break if binding/unbinding were to 
be allowed
   to
   >       go out of order (of submission) and user need to be extra
   careful
   >       not to run into pre-mature triggereing of out-fence and bind
   failing
   >       as VA is still in use etc.
   >
   >       Also, VM_BIND binds the provided mapping on the specified
   address
   >       space
   >       (VM). So, the uapi is not engine/context specific.
   >
   >       We can however add a 'queue' to the uapi which can be 
one from
   the
   >       pre-defined queues,
   >       I915_VM_BIND_QUEUE_0
   >       I915_VM_BIND_QUEUE_1
   >       ...
   >       I915_VM_BIND_QUEUE_(N-1)
   >
   >       KMD will spawn an async work queue for each queue which will
   only
   >       bind the mappings on that queue in the order of submission.
   >       User can assign the queue to per engine or anything 
like that.
   >
   >       But again here, user need to be careful and not 
deadlock these
   >       queues with circular dependency of fences.
   >
   >       I prefer adding this later an as extension based on 
whether it
   >       is really helping with the implementation.
   >
   >     I can tell you right now that having everything on a single
   in-order
   >     queue will not get us the perf we want.  What vulkan 
really wants
   is one
   >     of two things:
   >      1. No implicit ordering of VM_BIND ops.  They just happen in
   whatever
   >     their dependencies are resolved and we ensure ordering 
ourselves
   by
   >     having a syncobj in the VkQueue.
   >      2. The ability to create multiple VM_BIND queues.  We need at
   least 2
   >     but I don't see why there needs to be a limit besides 
the limits
   the
   >     i915 API already has on the number of engines.  Vulkan could
   expose
   >     multiple sparse binding queues to the client if it's not
   arbitrarily
   >     limited.

   Thanks Jason, Lionel.

   Jason, what are you referring to when you say "limits the i915 API
   already
   has on the number of engines"? I am not sure if there is such an uapi
   today.

 There's a limit of something like 64 total engines today based on the
 number of bits we can cram into the exec flags in execbuffer2.  I think
 someone had an extended version that allowed more but I ripped it out
 because no one was using it.  Of course, execbuffer3 might not 
have that
 problem at all.

Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE   64
       __u32 queue;

I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines 
hardware can have? I suggest not to do that.

Change with added this:

	if (set.num_engines > I915_EXEC_RING_MASK + 1)
		return -EINVAL;

To context creation needs to be undone and so let users create engine 
maps with all hardware engines, and let execbuf3 access them all.

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues
need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index.
Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.

Niranjana

Regards,

Tvrtko

Niranjana

   I am trying to see how many queues we need and don't want it to be
   arbitrarily
   large and unduely blow up memory usage and complexity in i915 driver.

 I expect a Vulkan driver to use at most 2 in the vast majority 
of cases. I
 could imagine a client wanting to create more than 1 sparse 
queue in which
 case, it'll be N+1 but that's unlikely.  As far as complexity 
goes, once
 you allow two, I don't think the complexity is going up by 
allowing N.  As
 for memory usage, creating more queues means more memory.  That's a
 trade-off that userspace can make.  Again, the expected number 
here is 1
 or 2 in the vast majority of cases so I don't think you need to worry.

Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.

Niranjana

   >     Why?  Because Vulkan has two basic kind of bind 
operations and we
   don't
   >     want any dependencies between them:
   >      1. Immediate.  These happen right after BO creation or 
maybe as
   part of
   >     vkBindImageMemory() or VkBindBufferMemory().  These 
don't happen
   on a
   >     queue and we don't want them serialized with anything.  To
   synchronize
   >     with submit, we'll have a syncobj in the VkDevice which is
   signaled by
   >     all immediate bind operations and make submits wait on it.
   >      2. Queued (sparse): These happen on a VkQueue which may be the
   same as
   >     a render/compute queue or may be its own queue.  It's up to us
   what we
   >     want to advertise.  From the Vulkan API PoV, this is like any
   other
   >     queue.  Operations on it wait on and signal semaphores.  If we
   have a
   >     VM_BIND engine, we'd provide syncobjs to wait and 
signal just like
   we do
   >     in execbuf().
   >     The important thing is that we don't want one type of 
operation to
   block
   >     on the other.  If immediate binds are blocking on sparse binds,
   it's
   >     going to cause over-synchronization issues.
   >     In terms of the internal implementation, I know that 
there's going
   to be
   >     a lock on the VM and that we can't actually do these things in
   >     parallel.  That's fine.  Once the dma_fences have signaled and
   we're

   Thats correct. It is like a single VM_BIND engine with 
multiple queues
   feeding to it.

 Right.  As long as the queues themselves are independent and 
can block on
 dma_fences without holding up other queues, I think we're fine.

   >     unblocked to do the bind operation, I don't care if 
there's a bit
   of
   >     synchronization due to locking.  That's expected.  What 
we can't
   afford
   >     to have is an immediate bind operation suddenly blocking on a
   sparse
   >     operation which is blocked on a compute job that's going to run
   for
   >     another 5ms.

   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
   VM_BIND
   on other VMs. I am not sure about usecases here, but just wanted to
   clarify.

 Yes, that's what I would expect.
 --Jason

   Niranjana

   >     For reference, Windows solves this by allowing arbitrarily many
   paging
   >     queues (what they call a VM_BIND engine/queue).  That 
design works
   >     pretty well and solves the problems in question.  
Again, we could
   just
   >     make everything out-of-order and require using syncobjs 
to order
   things
   >     as userspace wants. That'd be fine too.
   >     One more note while I'm here: danvet said something on 
IRC about
   VM_BIND
   >     queues waiting for syncobjs to materialize.  We don't really
   want/need
   >     this.  We already have all the machinery in userspace to handle
   >     wait-before-signal and waiting for syncobj fences to 
materialize
   and
   >     that machinery is on by default.  It would actually 
take MORE work
   in
   >     Mesa to turn it off and take advantage of the kernel 
being able to
   wait
   >     for syncobjs to materialize.  Also, getting that right is
   ridiculously
   >     hard and I really don't want to get it wrong in kernel 
space.     When we
   >     do memory fences, wait-before-signal will be a thing.  We don't
   need to
   >     try and make it a thing for syncobj.
   >     --Jason
   >
   >   Thanks Jason,
   >
   >   I missed the bit in the Vulkan spec that we're allowed to have a
   sparse
   >   queue that does not implement either graphics or compute 
operations
   :
   >
   >     "While some implementations may include
   VK_QUEUE_SPARSE_BINDING_BIT
   >     support in queue families that also include
   >
   >      graphics and compute support, other implementations may only
   expose a
   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
   >
   >      family."
   >
   >   So it can all be all a vm_bind engine that just does bind/unbind
   >   operations.
   >
   >   But yes we need another engine for the immediate/non-sparse
   operations.
   >
   >   -Lionel
   >
   >         >
   >       Daniel, any thoughts?
   >
   >       Niranjana
   >
   >       >Matt
   >       >
   >       >>
   >       >> Sorry I noticed this late.
   >       >>
   >       >>
   >       >> -Lionel
   >       >>
   >       >>