On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@xxxxxxxxx> wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
Vishwanathapura
wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@xxxxxxxxx> wrote:
>>>>
>>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
Landwerlin
wrote:
>>>> > On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>> >
>>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana
Vishwanathapura
>>>> > <niranjana.vishwanathapura@xxxxxxxxx> wrote:
>>>> >
>>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>Brost wrote:
>>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
Landwerlin
>>>> wrote:
>>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
wrote:
>>>> > >> > +VM_BIND/UNBIND ioctl will immediately start
>>>> binding/unbinding
>>>> > the mapping in an
>>>> > >> > +async worker. The binding and
unbinding will
>>>>work like a
>>>> special
>>>> > GPU engine.
>>>> > >> > +The binding and unbinding operations are
serialized and
>>>> will
>>>> > wait on specified
>>>> > >> > +input fences before the operation
and will signal
the
>>>> output
>>>> > fences upon the
>>>> > >> > +completion of the operation. Due to
serialization,
>>>> completion of
>>>> > an operation
>>>> > >> > +will also indicate that all
previous operations
>>>>are also
>>>> > complete.
>>>> > >>
>>>> > >> I guess we should avoid saying "will
immediately
start
>>>> > binding/unbinding" if
>>>> > >> there are fences involved.
>>>> > >>
>>>> > >> And the fact that it's happening in an async
>>>>worker seem to
>>>> imply
>>>> > it's not
>>>> > >> immediate.
>>>> > >>
>>>> >
>>>> > Ok, will fix.
>>>> > This was added because in earlier design
binding was
deferred
>>>> until
>>>> > next execbuff.
>>>> > But now it is non-deferred (immediate in
that sense).
>>>>But yah,
>>>> this is
>>>> > confusing
>>>> > and will fix it.
>>>> >
>>>> > >>
>>>> > >> I have a question on the behavior of the bind
>>>>operation when
>>>> no
>>>> > input fence
>>>> > >> is provided. Let say I do :
>>>> > >>
>>>> > >> VM_BIND (out_fence=fence1)
>>>> > >>
>>>> > >> VM_BIND (out_fence=fence2)
>>>> > >>
>>>> > >> VM_BIND (out_fence=fence3)
>>>> > >>
>>>> > >>
>>>> > >> In what order are the fences going to
be signaled?
>>>> > >>
>>>> > >> In the order of VM_BIND ioctls? Or out
of order?
>>>> > >>
>>>> > >> Because you wrote "serialized I assume
it's : in
order
>>>> > >>
>>>> >
>>>> > Yes, in the order of VM_BIND/UNBIND
ioctls. Note that
>>>>bind and
>>>> unbind
>>>> > will use
>>>> > the same queue and hence are ordered.
>>>> >
>>>> > >>
>>>> > >> One thing I didn't realize is that
because we only
get one
>>>> > "VM_BIND" engine,
>>>> > >> there is a disconnect from the Vulkan
specification.
>>>> > >>
>>>> > >> In Vulkan VM_BIND operations are
serialized but
>>>>per engine.
>>>> > >>
>>>> > >> So you could have something like this :
>>>> > >>
>>>> > >> VM_BIND (engine=rcs0, in_fence=fence1,
out_fence=fence2)
>>>> > >>
>>>> > >> VM_BIND (engine=ccs0, in_fence=fence3,
out_fence=fence4)
>>>> > >>
>>>> > >>
>>>> > >> fence1 is not signaled
>>>> > >>
>>>> > >> fence3 is signaled
>>>> > >>
>>>> > >> So the second VM_BIND will proceed before the
>>>>first VM_BIND.
>>>> > >>
>>>> > >>
>>>> > >> I guess we can deal with that scenario in
>>>>userspace by doing
>>>> the
>>>> > wait
>>>> > >> ourselves in one thread per engines.
>>>> > >>
>>>> > >> But then it makes the VM_BIND input
fences useless.
>>>> > >>
>>>> > >>
>>>> > >> Daniel : what do you think? Should be
rework this or
just
>>>> deal with
>>>> > wait
>>>> > >> fences in userspace?
>>>> > >>
>>>> > >
>>>> > >My opinion is rework this but make the
ordering via
>>>>an engine
>>>> param
>>>> > optional.
>>>> > >
>>>> > >e.g. A VM can be configured so all binds
are ordered
>>>>within the
>>>> VM
>>>> > >
>>>> > >e.g. A VM can be configured so all binds
accept an
engine
>>>> argument
>>>> > (in
>>>> > >the case of the i915 likely this is a
gem context
>>>>handle) and
>>>> binds
>>>> > >ordered with respect to that engine.
>>>> > >
>>>> > >This gives UMDs options as the later
likely consumes
>>>>more KMD
>>>> > resources
>>>> > >so if a different UMD can live with binds being
>>>>ordered within
>>>> the VM
>>>> > >they can use a mode consuming less resources.
>>>> > >
>>>> >
>>>> > I think we need to be careful here if we
are looking
for some
>>>> out of
>>>> > (submission) order completion of vm_bind/unbind.
>>>> > In-order completion means, in a batch of
binds and
>>>>unbinds to be
>>>> > completed in-order, user only needs to specify
>>>>in-fence for the
>>>> > first bind/unbind call and the our-fence
for the last
>>>> bind/unbind
>>>> > call. Also, the VA released by an unbind
call can be
>>>>re-used by
>>>> > any subsequent bind call in that in-order batch.
>>>> >
>>>> > These things will break if
binding/unbinding were to
>>>>be allowed
>>>> to
>>>> > go out of order (of submission) and user
need to be
extra
>>>> careful
>>>> > not to run into pre-mature triggereing of
out-fence and
bind
>>>> failing
>>>> > as VA is still in use etc.
>>>> >
>>>> > Also, VM_BIND binds the provided mapping on the
specified
>>>> address
>>>> > space
>>>> > (VM). So, the uapi is not engine/context
specific.
>>>> >
>>>> > We can however add a 'queue' to the uapi
which can be
>>>>one from
>>>> the
>>>> > pre-defined queues,
>>>> > I915_VM_BIND_QUEUE_0
>>>> > I915_VM_BIND_QUEUE_1
>>>> > ...
>>>> > I915_VM_BIND_QUEUE_(N-1)
>>>> >
>>>> > KMD will spawn an async work queue for
each queue which
will
>>>> only
>>>> > bind the mappings on that queue in the order of
submission.
>>>> > User can assign the queue to per engine
or anything
>>>>like that.
>>>> >
>>>> > But again here, user need to be careful and not
>>>>deadlock these
>>>> > queues with circular dependency of fences.
>>>> >
>>>> > I prefer adding this later an as
extension based on
>>>>whether it
>>>> > is really helping with the implementation.
>>>> >
>>>> > I can tell you right now that having
everything on a
single
>>>> in-order
>>>> > queue will not get us the perf we want.
What vulkan
>>>>really wants
>>>> is one
>>>> > of two things:
>>>> > 1. No implicit ordering of VM_BIND ops. They just
happen in
>>>> whatever
>>>> > their dependencies are resolved and we
ensure ordering
>>>>ourselves
>>>> by
>>>> > having a syncobj in the VkQueue.
>>>> > 2. The ability to create multiple VM_BIND
queues. We
need at
>>>> least 2
>>>> > but I don't see why there needs to be a
limit besides
>>>>the limits
>>>> the
>>>> > i915 API already has on the number of
engines. Vulkan
could
>>>> expose
>>>> > multiple sparse binding queues to the
client if it's not
>>>> arbitrarily
>>>> > limited.
>>>>
>>>> Thanks Jason, Lionel.
>>>>
>>>> Jason, what are you referring to when you say
"limits the i915
API
>>>> already
>>>> has on the number of engines"? I am not sure if
there is such
an uapi
>>>> today.
>>>>
>>>> There's a limit of something like 64 total engines
today based on
the
>>>> number of bits we can cram into the exec flags in
execbuffer2. I
think
>>>> someone had an extended version that allowed more
but I ripped it
out
>>>> because no one was using it. Of course,
execbuffer3 might not
>>>>have that
>>>> problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah,
execbuffer3
probably
>>>will not have this limiation. So, we need to define a
VM_BIND_MAX_QUEUE
>>>and somehow export it to user (I am thinking of
embedding it in
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK
(0x3f) which
execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind
structures,
>>
>>#define I915_VM_BIND_MAX_QUEUE 64
>> __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
> if (set.num_engines > I915_EXEC_RING_MASK + 1)
> return -EINVAL;
>
>To context creation needs to be undone and so let users
create engine
>maps with all hardware engines, and let execbuf3 access
them all.
>
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
execbuff3 also.
Hence, I was using the same limit for VM_BIND queues
(64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we
are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence
the vm_bind
queues
need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for
queue (a wq,
work_item and a linked list) lookup from the user
specified queue
index.
Other option is to just put some hard limit (say 64 or
65) and use
an array of queues in VM (each created upon first use).
I prefer this.
I don't get why a VM_BIND queue is any different from any
other queue or
userspace-visible kernel object. But I'll leave those
details up to
danvet or whoever else might be reviewing the implementation.
--Jason
I kind of agree here. Wouldn't be simpler to have the bind
queue created
like the others when we build the engine map?
For userspace it's then just matter of selecting the right
queue ID when
submitting.
If there is ever a possibility to have this work on the GPU,
it would be
all ready.