On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@xxxxxxxxx> wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> On 02/06/2022 23:35, Jason Ekstrand wrote:
>
> On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> <niranjana.vishwanathapura@xxxxxxxxx> wrote:
>
> On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding
> the mapping in an
> >> > +async worker. The binding and unbinding will work like a
special
> GPU engine.
> >> > +The binding and unbinding operations are serialized and
will
> wait on specified
> >> > +input fences before the operation and will signal the
output
> fences upon the
> >> > +completion of the operation. Due to serialization,
completion of
> an operation
> >> > +will also indicate that all previous operations are also
> complete.
> >>
> >> I guess we should avoid saying "will immediately start
> binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to
imply
> it's not
> >> immediate.
> >>
>
> Ok, will fix.
> This was added because in earlier design binding was deferred
until
> next execbuff.
> But now it is non-deferred (immediate in that sense). But yah,
this is
> confusing
> and will fix it.
>
> >>
> >> I have a question on the behavior of the bind operation when
no
> input fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
>
> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
unbind
> will use
> the same queue and hence are ordered.
>
> >>
> >> One thing I didn't realize is that because we only get one
> "VM_BIND" engine,
> >> there is a disconnect from the Vulkan specification.
> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> >>
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing
the
> wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.
> >>
> >>
> >> Daniel : what do you think? Should be rework this or just
deal with
> wait
> >> fences in userspace?
> >>
> >
> >My opinion is rework this but make the ordering via an engine
param
> optional.
> >
> >e.g. A VM can be configured so all binds are ordered within the
VM
> >
> >e.g. A VM can be configured so all binds accept an engine
argument
> (in
> >the case of the i915 likely this is a gem context handle) and
binds
> >ordered with respect to that engine.
> >
> >This gives UMDs options as the later likely consumes more KMD
> resources
> >so if a different UMD can live with binds being ordered within
the VM
> >they can use a mode consuming less resources.
> >
>
> I think we need to be careful here if we are looking for some
out of
> (submission) order completion of vm_bind/unbind.
> In-order completion means, in a batch of binds and unbinds to be
> completed in-order, user only needs to specify in-fence for the
> first bind/unbind call and the our-fence for the last
bind/unbind
> call. Also, the VA released by an unbind call can be re-used by
> any subsequent bind call in that in-order batch.
>
> These things will break if binding/unbinding were to be allowed
to
> go out of order (of submission) and user need to be extra
careful
> not to run into pre-mature triggereing of out-fence and bind
failing
> as VA is still in use etc.
>
> Also, VM_BIND binds the provided mapping on the specified
address
> space
> (VM). So, the uapi is not engine/context specific.
>
> We can however add a 'queue' to the uapi which can be one from
the
> pre-defined queues,
> I915_VM_BIND_QUEUE_0
> I915_VM_BIND_QUEUE_1
> ...
> I915_VM_BIND_QUEUE_(N-1)
>
> KMD will spawn an async work queue for each queue which will
only
> bind the mappings on that queue in the order of submission.
> User can assign the queue to per engine or anything like that.
>
> But again here, user need to be careful and not deadlock these
> queues with circular dependency of fences.
>
> I prefer adding this later an as extension based on whether it
> is really helping with the implementation.
>
> I can tell you right now that having everything on a single
in-order
> queue will not get us the perf we want. What vulkan really wants
is one
> of two things:
> 1. No implicit ordering of VM_BIND ops. They just happen in
whatever
> their dependencies are resolved and we ensure ordering ourselves
by
> having a syncobj in the VkQueue.
> 2. The ability to create multiple VM_BIND queues. We need at
least 2
> but I don't see why there needs to be a limit besides the limits
the
> i915 API already has on the number of engines. Vulkan could
expose
> multiple sparse binding queues to the client if it's not
arbitrarily
> limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API
already
has on the number of engines"? I am not sure if there is such an uapi
today.
There's a limit of something like 64 total engines today based on the
number of bits we can cram into the exec flags in execbuffer2. I think
someone had an extended version that allowed more but I ripped it out
because no one was using it. Of course, execbuffer3 might not have that
problem at all.
Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.
Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
will also have. So, we can simply define in vm_bind/unbind structures,
#define I915_VM_BIND_MAX_QUEUE 64
__u32 queue;
I think that will keep things simple.
Niranjana
I am trying to see how many queues we need and don't want it to be
arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
could imagine a client wanting to create more than 1 sparse queue in which
case, it'll be N+1 but that's unlikely. As far as complexity goes, once
you allow two, I don't think the complexity is going up by allowing N. As
for memory usage, creating more queues means more memory. That's a
trade-off that userspace can make. Again, the expected number here is 1
or 2 in the vast majority of cases so I don't think you need to worry.
Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.
Niranjana
> Why? Because Vulkan has two basic kind of bind operations and we
don't
> want any dependencies between them:
> 1. Immediate. These happen right after BO creation or maybe as
part of
> vkBindImageMemory() or VkBindBufferMemory(). These don't happen
on a
> queue and we don't want them serialized with anything. To
synchronize
> with submit, we'll have a syncobj in the VkDevice which is
signaled by
> all immediate bind operations and make submits wait on it.
> 2. Queued (sparse): These happen on a VkQueue which may be the
same as
> a render/compute queue or may be its own queue. It's up to us
what we
> want to advertise. From the Vulkan API PoV, this is like any
other
> queue. Operations on it wait on and signal semaphores. If we
have a
> VM_BIND engine, we'd provide syncobjs to wait and signal just like
we do
> in execbuf().
> The important thing is that we don't want one type of operation to
block
> on the other. If immediate binds are blocking on sparse binds,
it's
> going to cause over-synchronization issues.
> In terms of the internal implementation, I know that there's going
to be
> a lock on the VM and that we can't actually do these things in
> parallel. That's fine. Once the dma_fences have signaled and
we're
Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.
Right. As long as the queues themselves are independent and can block on
dma_fences without holding up other queues, I think we're fine.
> unblocked to do the bind operation, I don't care if there's a bit
of
> synchronization due to locking. That's expected. What we can't
afford
> to have is an immediate bind operation suddenly blocking on a
sparse
> operation which is blocked on a compute job that's going to run
for
> another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to
clarify.
Yes, that's what I would expect.
--Jason
Niranjana
> For reference, Windows solves this by allowing arbitrarily many
paging
> queues (what they call a VM_BIND engine/queue). That design works
> pretty well and solves the problems in question. Again, we could
just
> make everything out-of-order and require using syncobjs to order
things
> as userspace wants. That'd be fine too.
> One more note while I'm here: danvet said something on IRC about
VM_BIND
> queues waiting for syncobjs to materialize. We don't really
want/need
> this. We already have all the machinery in userspace to handle
> wait-before-signal and waiting for syncobj fences to materialize
and
> that machinery is on by default. It would actually take MORE work
in
> Mesa to turn it off and take advantage of the kernel being able to
wait
> for syncobjs to materialize. Also, getting that right is
ridiculously
> hard and I really don't want to get it wrong in kernel
space. When we
> do memory fences, wait-before-signal will be a thing. We don't
need to
> try and make it a thing for syncobj.
> --Jason
>
> Thanks Jason,
>
> I missed the bit in the Vulkan spec that we're allowed to have a
sparse
> queue that does not implement either graphics or compute operations
:
>
> "While some implementations may include
VK_QUEUE_SPARSE_BINDING_BIT
> support in queue families that also include
>
> graphics and compute support, other implementations may only
expose a
> VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
> family."
>
> So it can all be all a vm_bind engine that just does bind/unbind
> operations.
>
> But yes we need another engine for the immediate/non-sparse
operations.
>
> -Lionel
>
> >
> Daniel, any thoughts?
>
> Niranjana
>
> >Matt
> >
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>