Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Lionel Landwerlin <lionel.g.landwerlin@xxxxxxxxx> · Thu, 9 Jun 2022 17:49:09 +0300



    On 09/06/2022 00:55, Jason Ekstrand
      wrote:

    
          On Wed, Jun 8, 2022 at 4:44
            PM Niranjana Vishwanathapura <niranjana.vishwanathapura@xxxxxxxxx>
            wrote:

          
          On Wed, Jun 08, 2022 at
            08:33:25AM +0100, Tvrtko Ursulin wrote:

            >

            >

            >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:

            >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
            Vishwanathapura wrote:

            >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
            Ekstrand wrote:

            >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
            Vishwanathapura

            >>>> <niranjana.vishwanathapura@xxxxxxxxx>
            wrote:

            >>>>

            >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300,
            Lionel Landwerlin wrote:

            >>>>   >   On 02/06/2022 23:35, Jason
            Ekstrand wrote:

            >>>>   >

            >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM
            Niranjana Vishwanathapura

            >>>>   >     <niranjana.vishwanathapura@xxxxxxxxx>
            wrote:

            >>>>   >

            >>>>   >       On Wed, Jun 01, 2022 at
            01:28:36PM -0700, Matthew 

            >>>>Brost wrote:

            >>>>   >       >On Wed, Jun 01, 2022 at
            05:25:49PM +0300, Lionel Landwerlin

            >>>>   wrote:

            >>>>   >       >> On 17/05/2022 21:32,
            Niranjana Vishwanathapura wrote:

            >>>>   >       >> > +VM_BIND/UNBIND
            ioctl will immediately start

            >>>>   binding/unbinding

            >>>>   >       the mapping in an

            >>>>   >       >> > +async worker.
            The binding and unbinding will 

            >>>>work like a

            >>>>   special

            >>>>   >       GPU engine.

            >>>>   >       >> > +The binding and
            unbinding operations are serialized and

            >>>>   will

            >>>>   >       wait on specified

            >>>>   >       >> > +input fences
            before the operation and will signal the

            >>>>   output

            >>>>   >       fences upon the

            >>>>   >       >> > +completion of
            the operation. Due to serialization,

            >>>>   completion of

            >>>>   >       an operation

            >>>>   >       >> > +will also
            indicate that all previous operations 

            >>>>are also

            >>>>   >       complete.

            >>>>   >       >>

            >>>>   >       >> I guess we should
            avoid saying "will immediately start

            >>>>   >       binding/unbinding" if

            >>>>   >       >> there are fences
            involved.

            >>>>   >       >>

            >>>>   >       >> And the fact that
            it's happening in an async 

            >>>>worker seem to

            >>>>   imply

            >>>>   >       it's not

            >>>>   >       >> immediate.

            >>>>   >       >>

            >>>>   >

            >>>>   >       Ok, will fix.

            >>>>   >       This was added because in
            earlier design binding was deferred

            >>>>   until

            >>>>   >       next execbuff.

            >>>>   >       But now it is non-deferred
            (immediate in that sense). 

            >>>>But yah,

            >>>>   this is

            >>>>   >       confusing

            >>>>   >       and will fix it.

            >>>>   >

            >>>>   >       >>

            >>>>   >       >> I have a question on
            the behavior of the bind 

            >>>>operation when

            >>>>   no

            >>>>   >       input fence

            >>>>   >       >> is provided. Let say
            I do :

            >>>>   >       >>

            >>>>   >       >> VM_BIND
            (out_fence=fence1)

            >>>>   >       >>

            >>>>   >       >> VM_BIND
            (out_fence=fence2)

            >>>>   >       >>

            >>>>   >       >> VM_BIND
            (out_fence=fence3)

            >>>>   >       >>

            >>>>   >       >>

            >>>>   >       >> In what order are the
            fences going to be signaled?

            >>>>   >       >>

            >>>>   >       >> In the order of
            VM_BIND ioctls? Or out of order?

            >>>>   >       >>

            >>>>   >       >> Because you wrote
            "serialized I assume it's : in order

            >>>>   >       >>

            >>>>   >

            >>>>   >       Yes, in the order of
            VM_BIND/UNBIND ioctls. Note that 

            >>>>bind and

            >>>>   unbind

            >>>>   >       will use

            >>>>   >       the same queue and hence are
            ordered.

            >>>>   >

            >>>>   >       >>

            >>>>   >       >> One thing I didn't
            realize is that because we only get one

            >>>>   >       "VM_BIND" engine,

            >>>>   >       >> there is a disconnect
            from the Vulkan specification.

            >>>>   >       >>

            >>>>   >       >> In Vulkan VM_BIND
            operations are serialized but 

            >>>>per engine.

            >>>>   >       >>

            >>>>   >       >> So you could have
            something like this :

            >>>>   >       >>

            >>>>   >       >> VM_BIND (engine=rcs0,
            in_fence=fence1, out_fence=fence2)

            >>>>   >       >>

            >>>>   >       >> VM_BIND (engine=ccs0,
            in_fence=fence3, out_fence=fence4)

            >>>>   >       >>

            >>>>   >       >>

            >>>>   >       >> fence1 is not
            signaled

            >>>>   >       >>

            >>>>   >       >> fence3 is signaled

            >>>>   >       >>

            >>>>   >       >> So the second VM_BIND
            will proceed before the 

            >>>>first VM_BIND.

            >>>>   >       >>

            >>>>   >       >>

            >>>>   >       >> I guess we can deal
            with that scenario in 

            >>>>userspace by doing

            >>>>   the

            >>>>   >       wait

            >>>>   >       >> ourselves in one
            thread per engines.

            >>>>   >       >>

            >>>>   >       >> But then it makes the
            VM_BIND input fences useless.

            >>>>   >       >>

            >>>>   >       >>

            >>>>   >       >> Daniel : what do you
            think? Should be rework this or just

            >>>>   deal with

            >>>>   >       wait

            >>>>   >       >> fences in userspace?

            >>>>   >       >>

            >>>>   >       >

            >>>>   >       >My opinion is rework this
            but make the ordering via 

            >>>>an engine

            >>>>   param

            >>>>   >       optional.

            >>>>   >       >

            >>>>   >       >e.g. A VM can be
            configured so all binds are ordered 

            >>>>within the

            >>>>   VM

            >>>>   >       >

            >>>>   >       >e.g. A VM can be
            configured so all binds accept an engine

            >>>>   argument

            >>>>   >       (in

            >>>>   >       >the case of the i915
            likely this is a gem context 

            >>>>handle) and

            >>>>   binds

            >>>>   >       >ordered with respect to
            that engine.

            >>>>   >       >

            >>>>   >       >This gives UMDs options as
            the later likely consumes 

            >>>>more KMD

            >>>>   >       resources

            >>>>   >       >so if a different UMD can
            live with binds being 

            >>>>ordered within

            >>>>   the VM

            >>>>   >       >they can use a mode
            consuming less resources.

            >>>>   >       >

            >>>>   >

            >>>>   >       I think we need to be careful
            here if we are looking for some

            >>>>   out of

            >>>>   >       (submission) order completion
            of vm_bind/unbind.

            >>>>   >       In-order completion means, in
            a batch of binds and 

            >>>>unbinds to be

            >>>>   >       completed in-order, user only
            needs to specify 

            >>>>in-fence for the

            >>>>   >       first bind/unbind call and the
            our-fence for the last

            >>>>   bind/unbind

            >>>>   >       call. Also, the VA released by
            an unbind call can be 

            >>>>re-used by

            >>>>   >       any subsequent bind call in
            that in-order batch.

            >>>>   >

            >>>>   >       These things will break if
            binding/unbinding were to 

            >>>>be allowed

            >>>>   to

            >>>>   >       go out of order (of
            submission) and user need to be extra

            >>>>   careful

            >>>>   >       not to run into pre-mature
            triggereing of out-fence and bind

            >>>>   failing

            >>>>   >       as VA is still in use etc.

            >>>>   >

            >>>>   >       Also, VM_BIND binds the
            provided mapping on the specified

            >>>>   address

            >>>>   >       space

            >>>>   >       (VM). So, the uapi is not
            engine/context specific.

            >>>>   >

            >>>>   >       We can however add a 'queue'
            to the uapi which can be 

            >>>>one from

            >>>>   the

            >>>>   >       pre-defined queues,

            >>>>   >       I915_VM_BIND_QUEUE_0

            >>>>   >       I915_VM_BIND_QUEUE_1

            >>>>   >       ...

            >>>>   >       I915_VM_BIND_QUEUE_(N-1)

            >>>>   >

            >>>>   >       KMD will spawn an async work
            queue for each queue which will

            >>>>   only

            >>>>   >       bind the mappings on that
            queue in the order of submission.

            >>>>   >       User can assign the queue to
            per engine or anything 

            >>>>like that.

            >>>>   >

            >>>>   >       But again here, user need to
            be careful and not 

            >>>>deadlock these

            >>>>   >       queues with circular
            dependency of fences.

            >>>>   >

            >>>>   >       I prefer adding this later an
            as extension based on 

            >>>>whether it

            >>>>   >       is really helping with the
            implementation.

            >>>>   >

            >>>>   >     I can tell you right now that
            having everything on a single

            >>>>   in-order

            >>>>   >     queue will not get us the perf
            we want.  What vulkan 

            >>>>really wants

            >>>>   is one

            >>>>   >     of two things:

            >>>>   >      1. No implicit ordering of
            VM_BIND ops.  They just happen in

            >>>>   whatever

            >>>>   >     their dependencies are resolved
            and we ensure ordering 

            >>>>ourselves

            >>>>   by

            >>>>   >     having a syncobj in the VkQueue.

            >>>>   >      2. The ability to create
            multiple VM_BIND queues.  We need at

            >>>>   least 2

            >>>>   >     but I don't see why there needs
            to be a limit besides 

            >>>>the limits

            >>>>   the

            >>>>   >     i915 API already has on the
            number of engines.  Vulkan could

            >>>>   expose

            >>>>   >     multiple sparse binding queues
            to the client if it's not

            >>>>   arbitrarily

            >>>>   >     limited.

            >>>>

            >>>>   Thanks Jason, Lionel.

            >>>>

            >>>>   Jason, what are you referring to when you
            say "limits the i915 API

            >>>>   already

            >>>>   has on the number of engines"? I am not
            sure if there is such an uapi

            >>>>   today.

            >>>>

            >>>> There's a limit of something like 64 total
            engines today based on the

            >>>> number of bits we can cram into the exec
            flags in execbuffer2.  I think

            >>>> someone had an extended version that
            allowed more but I ripped it out

            >>>> because no one was using it.  Of course,
            execbuffer3 might not 

            >>>>have that

            >>>> problem at all.

            >>>>

            >>>

            >>>Thanks Jason.

            >>>Ok, I am not sure which exec flag is that, but
            yah, execbuffer3 probably

            >>>will not have this limiation. So, we need to
            define a VM_BIND_MAX_QUEUE

            >>>and somehow export it to user (I am thinking of
            embedding it in

            >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
            bits[1-3]->'n' meaning 2^n

            >>>queues.

            >>

            >>Ah, I think you are waking about I915_EXEC_RING_MASK
            (0x3f) which execbuf3

          
          Yup!  That's exactly the limit I was talking about.

          
            >>will also have. So, we can simply define in
            vm_bind/unbind structures,

            >>

            >>#define I915_VM_BIND_MAX_QUEUE   64

            >>        __u32 queue;

            >>

            >>I think that will keep things simple.

            >

            >Hmmm? What does execbuf2 limit has to do with how many
            engines 

            >hardware can have? I suggest not to do that.

            >

            >Change with added this:

            >

            >       if (set.num_engines > I915_EXEC_RING_MASK + 1)

            >               return -EINVAL;

            >

            >To context creation needs to be undone and so let users
            create engine 

            >maps with all hardware engines, and let execbuf3 access
            them all.

            >

            
            Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
            execbuff3 also.

            Hence, I was using the same limit for VM_BIND queues (64, or
            65 if we

            make it N+1).

            But, as discussed in other thread of this RFC series, we are
            planning

            to drop this I915_EXEC_RING_MAP in execbuff3. So, there
            won't be

            any uapi that limits the number of engines (and hence the
            vm_bind queues

            need to be supported).

            
            If we leave the number of vm_bind queues to be arbitrarily
            large

            (__u32 queue_idx) then, we need to have a hashmap for queue
            (a wq,

            work_item and a linked list) lookup from the user specified
            queue index.

            Other option is to just put some hard limit (say 64 or 65)
            and use

            an array of queues in VM (each created upon first use). I
            prefer this.

          
          I don't get why a VM_BIND queue is any different from any
            other queue or userspace-visible kernel object.  But I'll
            leave those details up to danvet or whoever else might be
            reviewing the implementation.
          

          --Jason
        
      
    I kind of agree here. Wouldn't be simpler to have the bind queue
      created like the others when we build the engine map?
    For userspace it's then just matter of selecting the right queue
      ID when submitting.
    If there is ever a possibility to have this work on the GPU, it
      would be all ready.
    

    Thanks,

    
    -Lionel

    
            Niranjana

            
            >Regards,

            >

            >Tvrtko

            >

            >>

            >>Niranjana

            >>

            >>>

            >>>>   I am trying to see how many queues we
            need and don't want it to be

            >>>>   arbitrarily

            >>>>   large and unduely blow up memory usage
            and complexity in i915 driver.

            >>>>

            >>>> I expect a Vulkan driver to use at most 2
            in the vast majority 

            >>>>of cases. I

            >>>> could imagine a client wanting to create
            more than 1 sparse 

            >>>>queue in which

            >>>> case, it'll be N+1 but that's unlikely.  As
            far as complexity 

            >>>>goes, once

            >>>> you allow two, I don't think the complexity
            is going up by 

            >>>>allowing N.  As

            >>>> for memory usage, creating more queues
            means more memory.  That's a

            >>>> trade-off that userspace can make.  Again,
            the expected number 

            >>>>here is 1

            >>>> or 2 in the vast majority of cases so I
            don't think you need to worry.

            >>>

            >>>Ok, will start with n=3 meaning 8 queues.

            >>>That would require us create 8 workqueues.

            >>>We can change 'n' later if required.

            >>>

            >>>Niranjana

            >>>

            >>>>

            >>>>   >     Why?  Because Vulkan has two
            basic kind of bind 

            >>>>operations and we

            >>>>   don't

            >>>>   >     want any dependencies between
            them:

            >>>>   >      1. Immediate.  These happen
            right after BO creation or 

            >>>>maybe as

            >>>>   part of

            >>>>   >     vkBindImageMemory() or
            VkBindBufferMemory().  These 

            >>>>don't happen

            >>>>   on a

            >>>>   >     queue and we don't want them
            serialized with anything.  To

            >>>>   synchronize

            >>>>   >     with submit, we'll have a
            syncobj in the VkDevice which is

            >>>>   signaled by

            >>>>   >     all immediate bind operations
            and make submits wait on it.

            >>>>   >      2. Queued (sparse): These
            happen on a VkQueue which may be the

            >>>>   same as

            >>>>   >     a render/compute queue or may be
            its own queue.  It's up to us

            >>>>   what we

            >>>>   >     want to advertise.  From the
            Vulkan API PoV, this is like any

            >>>>   other

            >>>>   >     queue.  Operations on it wait on
            and signal semaphores.  If we

            >>>>   have a

            >>>>   >     VM_BIND engine, we'd provide
            syncobjs to wait and 

            >>>>signal just like

            >>>>   we do

            >>>>   >     in execbuf().

            >>>>   >     The important thing is that we
            don't want one type of 

            >>>>operation to

            >>>>   block

            >>>>   >     on the other.  If immediate
            binds are blocking on sparse binds,

            >>>>   it's

            >>>>   >     going to cause
            over-synchronization issues.

            >>>>   >     In terms of the internal
            implementation, I know that 

            >>>>there's going

            >>>>   to be

            >>>>   >     a lock on the VM and that we
            can't actually do these things in

            >>>>   >     parallel.  That's fine.  Once
            the dma_fences have signaled and

            >>>>   we're

            >>>>

            >>>>   Thats correct. It is like a single
            VM_BIND engine with 

            >>>>multiple queues

            >>>>   feeding to it.

            >>>>

            >>>> Right.  As long as the queues themselves
            are independent and 

            >>>>can block on

            >>>> dma_fences without holding up other queues,
            I think we're fine.

            >>>>

            >>>>   >     unblocked to do the bind
            operation, I don't care if 

            >>>>there's a bit

            >>>>   of

            >>>>   >     synchronization due to locking. 
            That's expected.  What 

            >>>>we can't

            >>>>   afford

            >>>>   >     to have is an immediate bind
            operation suddenly blocking on a

            >>>>   sparse

            >>>>   >     operation which is blocked on a
            compute job that's going to run

            >>>>   for

            >>>>   >     another 5ms.

            >>>>

            >>>>   As the VM_BIND queue is per VM, VM_BIND
            on one VM doesn't block the

            >>>>   VM_BIND

            >>>>   on other VMs. I am not sure about
            usecases here, but just wanted to

            >>>>   clarify.

            >>>>

            >>>> Yes, that's what I would expect.

            >>>> --Jason

            >>>>

            >>>>   Niranjana

            >>>>

            >>>>   >     For reference, Windows solves
            this by allowing arbitrarily many

            >>>>   paging

            >>>>   >     queues (what they call a VM_BIND
            engine/queue).  That 

            >>>>design works

            >>>>   >     pretty well and solves the
            problems in question.  

            >>>>Again, we could

            >>>>   just

            >>>>   >     make everything out-of-order and
            require using syncobjs 

            >>>>to order

            >>>>   things

            >>>>   >     as userspace wants. That'd be
            fine too.

            >>>>   >     One more note while I'm here:
            danvet said something on 

            >>>>IRC about

            >>>>   VM_BIND

            >>>>   >     queues waiting for syncobjs to
            materialize.  We don't really

            >>>>   want/need

            >>>>   >     this.  We already have all the
            machinery in userspace to handle

            >>>>   >     wait-before-signal and waiting
            for syncobj fences to 

            >>>>materialize

            >>>>   and

            >>>>   >     that machinery is on by
            default.  It would actually 

            >>>>take MORE work

            >>>>   in

            >>>>   >     Mesa to turn it off and take
            advantage of the kernel 

            >>>>being able to

            >>>>   wait

            >>>>   >     for syncobjs to materialize. 
            Also, getting that right is

            >>>>   ridiculously

            >>>>   >     hard and I really don't want to
            get it wrong in kernel 

            >>>>space.     When we

            >>>>   >     do memory fences,
            wait-before-signal will be a thing.  We don't

            >>>>   need to

            >>>>   >     try and make it a thing for
            syncobj.

            >>>>   >     --Jason

            >>>>   >

            >>>>   >   Thanks Jason,

            >>>>   >

            >>>>   >   I missed the bit in the Vulkan
            spec that we're allowed to have a

            >>>>   sparse

            >>>>   >   queue that does not implement
            either graphics or compute 

            >>>>operations

            >>>>   :

            >>>>   >

            >>>>   >     "While some implementations may
            include

            >>>>   VK_QUEUE_SPARSE_BINDING_BIT

            >>>>   >     support in queue families that
            also include

            >>>>   >

            >>>>   >      graphics and compute support,
            other implementations may only

            >>>>   expose a

            >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only
            queue

            >>>>   >

            >>>>   >      family."

            >>>>   >

            >>>>   >   So it can all be all a vm_bind
            engine that just does bind/unbind

            >>>>   >   operations.

            >>>>   >

            >>>>   >   But yes we need another engine for
            the immediate/non-sparse

            >>>>   operations.

            >>>>   >

            >>>>   >   -Lionel

            >>>>   >

            >>>>   >         >

            >>>>   >       Daniel, any thoughts?

            >>>>   >

            >>>>   >       Niranjana

            >>>>   >

            >>>>   >       >Matt

            >>>>   >       >

            >>>>   >       >>

            >>>>   >       >> Sorry I noticed this
            late.

            >>>>   >       >>

            >>>>   >       >>

            >>>>   >       >> -Lionel

            >>>>   >       >>

            >>>>   >       >>