Thanks, Oak > -----Original Message----- > From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > Zeng, Oak > Sent: June 14, 2022 5:13 PM > To: Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; > Landwerlin, Lionel G <lionel.g.landwerlin@xxxxxxxxx> > Cc: Intel GFX <intel-gfx@xxxxxxxxxxxxxxxxxxxxx>; Wilson, Chris P > <chris.p.wilson@xxxxxxxxx>; Hellstrom, Thomas > <thomas.hellstrom@xxxxxxxxx>; Maling list - DRI developers <dri- > devel@xxxxxxxxxxxxxxxxxxxxx>; Vetter, Daniel <daniel.vetter@xxxxxxxxx>; > Christian König <christian.koenig@xxxxxxx> > Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design > document > > > > Thanks, > Oak > > > -----Original Message----- > > From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx> > > Sent: June 14, 2022 1:02 PM > > To: Landwerlin, Lionel G <lionel.g.landwerlin@xxxxxxxxx> > > Cc: Zeng, Oak <oak.zeng@xxxxxxxxx>; Intel GFX <intel- > > gfx@xxxxxxxxxxxxxxxxxxxxx>; Maling list - DRI developers <dri- > > devel@xxxxxxxxxxxxxxxxxxxxx>; Hellstrom, Thomas > > <thomas.hellstrom@xxxxxxxxx>; Wilson, Chris P > <chris.p.wilson@xxxxxxxxx>; > > Vetter, Daniel <daniel.vetter@xxxxxxxxx>; Christian König > > <christian.koenig@xxxxxxx> > > Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design > > document > > > > On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote: > > >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote: > > >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote: > > >>> > > >>> > > >>>Regards, > > >>>Oak > > >>> > > >>>>-----Original Message----- > > >>>>From: Intel-gfx <intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On > > >>>>Behalf Of Niranjana > > >>>>Vishwanathapura > > >>>>Sent: June 10, 2022 1:43 PM > > >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@xxxxxxxxx> > > >>>>Cc: Intel GFX <intel-gfx@xxxxxxxxxxxxxxxxxxxxx>; Maling list - > > >>>>DRI developers <dri- > > >>>>devel@xxxxxxxxxxxxxxxxxxxxx>; Hellstrom, Thomas > > >>>><thomas.hellstrom@xxxxxxxxx>; > > >>>>Wilson, Chris P <chris.p.wilson@xxxxxxxxx>; Vetter, Daniel > > >>>><daniel.vetter@xxxxxxxxx>; Christian König > > <christian.koenig@xxxxxxx> > > >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND > > >>>>feature design > > >>>>document > > >>>> > > >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote: > > >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote: > > >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: > > >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: > > >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin > wrote: > > >>>>>>>>> On 09/06/2022 00:55, Jason Ekstrand wrote: > > >>>>>>>>> > > >>>>>>>>> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura > > >>>>>>>>> <niranjana.vishwanathapura@xxxxxxxxx> wrote: > > >>>>>>>>> > > >>>>>>>>> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko > > >>>>Ursulin wrote: > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: > > >>>>>>>>> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana > > >>>>>>>>>Vishwanathapura > > >>>>>>>>> wrote: > > >>>>>>>>> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason > > >>>>>>>>>Ekstrand wrote: > > >>>>>>>>> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana > > >>>>Vishwanathapura > > >>>>>>>>> >>>> <niranjana.vishwanathapura@xxxxxxxxx> wrote: > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel > > >>>>>>>>>Landwerlin > > >>>>>>>>> wrote: > > >>>>>>>>> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana > > >>>>>>>>>Vishwanathapura > > >>>>>>>>> >>>> > <niranjana.vishwanathapura@xxxxxxxxx> wrote: > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM > > >>>>-0700, Matthew > > >>>>>>>>> >>>>Brost wrote: > > >>>>>>>>> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM > > >>>>+0300, Lionel > > >>>>>>>>> Landwerlin > > >>>>>>>>> >>>> wrote: > > >>>>>>>>> >>>> > >> On 17/05/2022 21:32, Niranjana > Vishwanathapura > > >>>>>>>>> wrote: > > >>>>>>>>> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start > > >>>>>>>>> >>>> binding/unbinding > > >>>>>>>>> >>>> > the mapping in an > > >>>>>>>>> >>>> > >> > +async worker. The binding and > > >>>>>>>>>unbinding will > > >>>>>>>>> >>>>work like a > > >>>>>>>>> >>>> special > > >>>>>>>>> >>>> > GPU engine. > > >>>>>>>>> >>>> > >> > +The binding and unbinding operations are > > >>>>>>>>> serialized and > > >>>>>>>>> >>>> will > > >>>>>>>>> >>>> > wait on specified > > >>>>>>>>> >>>> > >> > +input fences before the operation > > >>>>>>>>>and will signal > > >>>>>>>>> the > > >>>>>>>>> >>>> output > > >>>>>>>>> >>>> > fences upon the > > >>>>>>>>> >>>> > >> > +completion of the operation. Due to > > >>>>>>>>> serialization, > > >>>>>>>>> >>>> completion of > > >>>>>>>>> >>>> > an operation > > >>>>>>>>> >>>> > >> > +will also indicate that all > > >>>>>>>>>previous operations > > >>>>>>>>> >>>>are also > > >>>>>>>>> >>>> > complete. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> I guess we should avoid saying "will > > >>>>>>>>>immediately > > >>>>>>>>> start > > >>>>>>>>> >>>> > binding/unbinding" if > > >>>>>>>>> >>>> > >> there are fences involved. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> And the fact that it's happening in an async > > >>>>>>>>> >>>>worker seem to > > >>>>>>>>> >>>> imply > > >>>>>>>>> >>>> > it's not > > >>>>>>>>> >>>> > >> immediate. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > Ok, will fix. > > >>>>>>>>> >>>> > This was added because in earlier design > > >>>>>>>>>binding was > > >>>>>>>>> deferred > > >>>>>>>>> >>>> until > > >>>>>>>>> >>>> > next execbuff. > > >>>>>>>>> >>>> > But now it is non-deferred (immediate in > > >>>>>>>>>that sense). > > >>>>>>>>> >>>>But yah, > > >>>>>>>>> >>>> this is > > >>>>>>>>> >>>> > confusing > > >>>>>>>>> >>>> > and will fix it. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> I have a question on the behavior of the bind > > >>>>>>>>> >>>>operation when > > >>>>>>>>> >>>> no > > >>>>>>>>> >>>> > input fence > > >>>>>>>>> >>>> > >> is provided. Let say I do : > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> VM_BIND (out_fence=fence1) > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> VM_BIND (out_fence=fence2) > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> VM_BIND (out_fence=fence3) > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> In what order are the fences going to > > >>>>>>>>>be signaled? > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> In the order of VM_BIND ioctls? Or out > > >>>>>>>>>of order? > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> Because you wrote "serialized I assume > > >>>>>>>>>it's : in > > >>>>>>>>> order > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > Yes, in the order of VM_BIND/UNBIND > > >>>>>>>>>ioctls. Note that > > >>>>>>>>> >>>>bind and > > >>>>>>>>> >>>> unbind > > >>>>>>>>> >>>> > will use > > >>>>>>>>> >>>> > the same queue and hence are ordered. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> One thing I didn't realize is that > > >>>>>>>>>because we only > > >>>>>>>>> get one > > >>>>>>>>> >>>> > "VM_BIND" engine, > > >>>>>>>>> >>>> > >> there is a disconnect from the Vulkan > > >>>>>>>>>specification. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> In Vulkan VM_BIND operations are > > >>>>>>>>>serialized but > > >>>>>>>>> >>>>per engine. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> So you could have something like this : > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, > > >>>>>>>>> out_fence=fence2) > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, > > >>>>>>>>> out_fence=fence4) > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> fence1 is not signaled > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> fence3 is signaled > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> So the second VM_BIND will proceed before the > > >>>>>>>>> >>>>first VM_BIND. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> I guess we can deal with that scenario in > > >>>>>>>>> >>>>userspace by doing > > >>>>>>>>> >>>> the > > >>>>>>>>> >>>> > wait > > >>>>>>>>> >>>> > >> ourselves in one thread per engines. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> But then it makes the VM_BIND input > > >>>>>>>>>fences useless. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> Daniel : what do you think? Should be > > >>>>>>>>>rework this or > > >>>>>>>>> just > > >>>>>>>>> >>>> deal with > > >>>>>>>>> >>>> > wait > > >>>>>>>>> >>>> > >> fences in userspace? > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > >My opinion is rework this but make the > > >>>>>>>>>ordering via > > >>>>>>>>> >>>>an engine > > >>>>>>>>> >>>> param > > >>>>>>>>> >>>> > optional. > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > >e.g. A VM can be configured so all binds > > >>>>>>>>>are ordered > > >>>>>>>>> >>>>within the > > >>>>>>>>> >>>> VM > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > >e.g. A VM can be configured so all binds > > >>>>>>>>>accept an > > >>>>>>>>> engine > > >>>>>>>>> >>>> argument > > >>>>>>>>> >>>> > (in > > >>>>>>>>> >>>> > >the case of the i915 likely this is a > > >>>>>>>>>gem context > > >>>>>>>>> >>>>handle) and > > >>>>>>>>> >>>> binds > > >>>>>>>>> >>>> > >ordered with respect to that engine. > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > >This gives UMDs options as the later > > >>>>>>>>>likely consumes > > >>>>>>>>> >>>>more KMD > > >>>>>>>>> >>>> > resources > > >>>>>>>>> >>>> > >so if a different UMD can live with > > >>>>binds being > > >>>>>>>>> >>>>ordered within > > >>>>>>>>> >>>> the VM > > >>>>>>>>> >>>> > >they can use a mode consuming less resources. > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > I think we need to be careful here if we > > >>>>>>>>>are looking > > >>>>>>>>> for some > > >>>>>>>>> >>>> out of > > >>>>>>>>> >>>> > (submission) order completion of vm_bind/unbind. > > >>>>>>>>> >>>> > In-order completion means, in a batch of > > >>>>>>>>>binds and > > >>>>>>>>> >>>>unbinds to be > > >>>>>>>>> >>>> > completed in-order, user only needs to specify > > >>>>>>>>> >>>>in-fence for the > > >>>>>>>>> >>>> > first bind/unbind call and the our-fence > > >>>>>>>>>for the last > > >>>>>>>>> >>>> bind/unbind > > >>>>>>>>> >>>> > call. Also, the VA released by an unbind > > >>>>>>>>>call can be > > >>>>>>>>> >>>>re-used by > > >>>>>>>>> >>>> > any subsequent bind call in that > > >>>>in-order batch. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > These things will break if > > >>>>>>>>>binding/unbinding were to > > >>>>>>>>> >>>>be allowed > > >>>>>>>>> >>>> to > > >>>>>>>>> >>>> > go out of order (of submission) and user > > >>>>>>>>>need to be > > >>>>>>>>> extra > > >>>>>>>>> >>>> careful > > >>>>>>>>> >>>> > not to run into pre-mature triggereing of > > >>>>>>>>>out-fence and > > >>>>>>>>> bind > > >>>>>>>>> >>>> failing > > >>>>>>>>> >>>> > as VA is still in use etc. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > Also, VM_BIND binds the provided > > >>>>mapping on the > > >>>>>>>>> specified > > >>>>>>>>> >>>> address > > >>>>>>>>> >>>> > space > > >>>>>>>>> >>>> > (VM). So, the uapi is not engine/context > > >>>>>>>>>specific. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > We can however add a 'queue' to the uapi > > >>>>>>>>>which can be > > >>>>>>>>> >>>>one from > > >>>>>>>>> >>>> the > > >>>>>>>>> >>>> > pre-defined queues, > > >>>>>>>>> >>>> > I915_VM_BIND_QUEUE_0 > > >>>>>>>>> >>>> > I915_VM_BIND_QUEUE_1 > > >>>>>>>>> >>>> > ... > > >>>>>>>>> >>>> > I915_VM_BIND_QUEUE_(N-1) > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > KMD will spawn an async work queue for > > >>>>>>>>>each queue which > > >>>>>>>>> will > > >>>>>>>>> >>>> only > > >>>>>>>>> >>>> > bind the mappings on that queue in the > > >>>>order of > > >>>>>>>>> submission. > > >>>>>>>>> >>>> > User can assign the queue to per engine > > >>>>>>>>>or anything > > >>>>>>>>> >>>>like that. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > But again here, user need to be > > >>>>careful and not > > >>>>>>>>> >>>>deadlock these > > >>>>>>>>> >>>> > queues with circular dependency of fences. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > I prefer adding this later an as > > >>>>>>>>>extension based on > > >>>>>>>>> >>>>whether it > > >>>>>>>>> >>>> > is really helping with the implementation. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > I can tell you right now that having > > >>>>>>>>>everything on a > > >>>>>>>>> single > > >>>>>>>>> >>>> in-order > > >>>>>>>>> >>>> > queue will not get us the perf we want. > > >>>>>>>>>What vulkan > > >>>>>>>>> >>>>really wants > > >>>>>>>>> >>>> is one > > >>>>>>>>> >>>> > of two things: > > >>>>>>>>> >>>> > 1. No implicit ordering of VM_BIND > > >>>>ops. They just > > >>>>>>>>> happen in > > >>>>>>>>> >>>> whatever > > >>>>>>>>> >>>> > their dependencies are resolved and we > > >>>>>>>>>ensure ordering > > >>>>>>>>> >>>>ourselves > > >>>>>>>>> >>>> by > > >>>>>>>>> >>>> > having a syncobj in the VkQueue. > > >>>>>>>>> >>>> > 2. The ability to create multiple VM_BIND > > >>>>>>>>>queues. We > > >>>>>>>>> need at > > >>>>>>>>> >>>> least 2 > > >>>>>>>>> >>>> > but I don't see why there needs to be a > > >>>>>>>>>limit besides > > >>>>>>>>> >>>>the limits > > >>>>>>>>> >>>> the > > >>>>>>>>> >>>> > i915 API already has on the number of > > >>>>>>>>>engines. Vulkan > > >>>>>>>>> could > > >>>>>>>>> >>>> expose > > >>>>>>>>> >>>> > multiple sparse binding queues to the > > >>>>>>>>>client if it's not > > >>>>>>>>> >>>> arbitrarily > > >>>>>>>>> >>>> > limited. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> Thanks Jason, Lionel. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> Jason, what are you referring to when you say > > >>>>>>>>>"limits the i915 > > >>>>>>>>> API > > >>>>>>>>> >>>> already > > >>>>>>>>> >>>> has on the number of engines"? I am not sure if > > >>>>>>>>>there is such > > >>>>>>>>> an uapi > > >>>>>>>>> >>>> today. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> There's a limit of something like 64 total engines > > >>>>>>>>>today based on > > >>>>>>>>> the > > >>>>>>>>> >>>> number of bits we can cram into the exec flags in > > >>>>>>>>>execbuffer2. I > > >>>>>>>>> think > > >>>>>>>>> >>>> someone had an extended version that allowed more > > >>>>>>>>>but I ripped it > > >>>>>>>>> out > > >>>>>>>>> >>>> because no one was using it. Of course, > > >>>>>>>>>execbuffer3 might not > > >>>>>>>>> >>>>have that > > >>>>>>>>> >>>> problem at all. > > >>>>>>>>> >>>> > > >>>>>>>>> >>> > > >>>>>>>>> >>>Thanks Jason. > > >>>>>>>>> >>>Ok, I am not sure which exec flag is that, but yah, > > >>>>>>>>>execbuffer3 > > >>>>>>>>> probably > > >>>>>>>>> >>>will not have this limiation. So, we need to define a > > >>>>>>>>> VM_BIND_MAX_QUEUE > > >>>>>>>>> >>>and somehow export it to user (I am thinking of > > >>>>>>>>>embedding it in > > >>>>>>>>> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, > > >>>>bits[1-3]->'n' > > >>>>>>>>> meaning 2^n > > >>>>>>>>> >>>queues. > > >>>>>>>>> >> > > >>>>>>>>> >>Ah, I think you are waking about I915_EXEC_RING_MASK > > >>>>>>>>>(0x3f) which > > >>>>>>>>> execbuf3 > > >>>>>>>>> > > >>>>>>>>> Yup! That's exactly the limit I was talking about. > > >>>>>>>>> > > >>>>>>>>> >>will also have. So, we can simply define in > vm_bind/unbind > > >>>>>>>>> structures, > > >>>>>>>>> >> > > >>>>>>>>> >>#define I915_VM_BIND_MAX_QUEUE 64 > > >>>>>>>>> >> __u32 queue; > > >>>>>>>>> >> > > >>>>>>>>> >>I think that will keep things simple. > > >>>>>>>>> > > > >>>>>>>>> >Hmmm? What does execbuf2 limit has to do with how > > >>>>many engines > > >>>>>>>>> >hardware can have? I suggest not to do that. > > >>>>>>>>> > > > >>>>>>>>> >Change with added this: > > >>>>>>>>> > > > >>>>>>>>> > if (set.num_engines > I915_EXEC_RING_MASK + 1) > > >>>>>>>>> > return -EINVAL; > > >>>>>>>>> > > > >>>>>>>>> >To context creation needs to be undone and so let users > > >>>>>>>>>create engine > > >>>>>>>>> >maps with all hardware engines, and let execbuf3 access > > >>>>>>>>>them all. > > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to > > >>>>>>>>>execbuff3 also. > > >>>>>>>>> Hence, I was using the same limit for VM_BIND queues > > >>>>>>>>>(64, or 65 if we > > >>>>>>>>> make it N+1). > > >>>>>>>>> But, as discussed in other thread of this RFC series, we > > >>>>>>>>>are planning > > >>>>>>>>> to drop this I915_EXEC_RING_MAP in execbuff3. So, > > >>>>there won't be > > >>>>>>>>> any uapi that limits the number of engines (and hence > > >>>>>>>>>the vm_bind > > >>>>>>>>> queues > > >>>>>>>>> need to be supported). > > >>>>>>>>> > > >>>>>>>>> If we leave the number of vm_bind queues to be > > >>>>arbitrarily large > > >>>>>>>>> (__u32 queue_idx) then, we need to have a hashmap for > > >>>>>>>>>queue (a wq, > > >>>>>>>>> work_item and a linked list) lookup from the user > > >>>>>>>>>specified queue > > >>>>>>>>> index. > > >>>>>>>>> Other option is to just put some hard limit (say 64 or > > >>>>>>>>>65) and use > > >>>>>>>>> an array of queues in VM (each created upon first use). > > >>>>>>>>>I prefer this. > > >>>>>>>>> > > >>>>>>>>> I don't get why a VM_BIND queue is any different from any > > >>>>>>>>>other queue or > > >>>>>>>>> userspace-visible kernel object. But I'll leave those > > >>>>>>>>>details up to > > >>>>>>>>> danvet or whoever else might be reviewing the > > implementation. > > >>>>>>>>> --Jason > > >>>>>>>>> > > >>>>>>>>> I kind of agree here. Wouldn't be simpler to have the bind > > >>>>>>>>>queue created > > >>>>>>>>> like the others when we build the engine map? > > >>>>>>>>> > > >>>>>>>>> For userspace it's then just matter of selecting the right > > >>>>>>>>>queue ID when > > >>>>>>>>> submitting. > > >>>>>>>>> > > >>>>>>>>> If there is ever a possibility to have this work on the GPU, > > >>>>>>>>>it would be > > >>>>>>>>> all ready. > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>>I did sync offline with Matt Brost on this. > > >>>>>>>>We can add a VM_BIND engine class and let user create > VM_BIND > > >>>>>>>>engines (queues). > > >>>>>>>>The problem is, in i915 engine creating interface is bound to > > >>>>>>>>gem_context. > > >>>>>>>>So, in vm_bind ioctl, we would need both context_id and > > >>>>>>>>queue_idx for proper > > >>>>>>>>lookup of the user created engine. This is bit ackward as > > >>>>vm_bind is an > > >>>>>>>>interface to VM (address space) and has nothing to do with > > >>>>gem_context. > > >>>>>>> > > >>>>>>> > > >>>>>>>A gem_context has a single vm object right? > > >>>>>>> > > >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a > > default > > >>>>>>>one if not. > > >>>>>>> > > >>>>>>>So it's just like picking up the vm like it's done at execbuffer > > >>>>>>>time right now : eb->context->vm > > >>>>>>> > > >>>>>> > > >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the > > >>>>>>VM_BIND/UNBIND > > >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can > > be > > >>>>>>obtained > > >>>>>>from the context? > > >>>>> > > >>>>> > > >>>>>Yes, because if we go for engines, they're associated with a context > > >>>>>and so also associated with the VM bound to the context. > > >>>>> > > >>>> > > >>>>Hmm...context doesn't sould like the right interface. It should be > > >>>>VM and engine (independent of context). Engine can be virtual or soft > > >>>>engine (kernel thread), each with its own queue. We can add an > > >>>>interface > > >>>>to create such engines (independent of context). But we are anway > > >>>>implicitly creating it when user uses a new queue_idx. If in future > > >>>>we have hardware engines for VM_BIND operation, we can have that > > >>>>explicit inteface to create engine instances and the queue_index > > >>>>in vm_bind/unbind will point to those engines. > > >>>>Anyone has any thoughts? Daniel? > > >>> > > >>>Exposing gem_context or intel_context to user space is a strange > > >>>concept to me. A context represent some hw resources that is used > > >>>to complete certain task. User space should care allocate some > > >>>resources (memory, queues) and submit tasks to queues. But user > > >>>space doesn't care how certain task is mapped to a HW context - > > >>>driver/guc should take care of this. > > >>> > > >>>So a cleaner interface to me is: user space create a vm, create > > >>>gem object, vm_bind it to a vm; allocate queues (internally > > >>>represent compute or blitter HW. Queue can be virtual to user) for > > >>>this vm; submit tasks to queues. User can create multiple queues > > >>>under one vm. One queue is only for one vm. > > >>> > > >>>I915 driver/guc manage the hw compute or blitter resources which > > >>>is transparent to user space. When i915 or guc decide to schedule > > >>>a queue (run tasks on that queue), a HW engine will be pick up and > > >>>set up properly for the vm of that queue (ie., switch to page > > >>>tables of that vm) - this is a context switch. > > >>> > > >>>From vm_bind perspective, it simply bind a gem_object to a vm. > > >>>Engine/queue is not a parameter to vm_bind, as any engine can be > > >>>pick up by i915/guc to execute a task using the vm bound va. > > >>> > > >>>I didn't completely follow the discussion here. Just share some > > >>>thoughts. > > >>> > > >> > > >>Yah, I agree. > > >> > > >>Lionel, > > >>How about we define the queue as > > >>union { > > >> __u32 queue_idx; > > >> __u64 rsvd; > > >>} > > >> > > >>If required, we can extend by expanding the 'rsvd' field to <ctx_id, > > >>queue_idx> later > > >>with a flag. > > >> > > >>Niranjana > > > > > > > > >I did not really understand Oak's comment nor what you're suggesting > > >here to be honest. > > > > > > > > >First the GEM context is already exposed to userspace. It's explicitly > > >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE. > > > > > >We give the GEM context id in every execbuffer we do with > > >drm_i915_gem_execbuffer2::rsvd1. > > > > > >It's still in the new execbuffer3 proposal being discussed. > > > > > > > > >Second, the GEM context is also where we set the VM with > > >I915_CONTEXT_PARAM_VM. > > > > > > > > >Third, the GEM context also has the list of engines with > > >I915_CONTEXT_PARAM_ENGINES. > > > > > > > Yes, the execbuf and engine map creation are tied to gem_context. > > (which probably is not the best interface.) > > > > > > > >So it makes sense to me to dispatch the vm_bind operation to a GEM > > >context, to a given vm_bind queue, because it's got all the > > >information required : > > > > > > - the list of new vm_bind queues > > > > > > - the vm that is going to be modified > > > > > > > But the operation is performed here on the address space (VM) which > > can have multiple gem_contexts referring to it. So, VM is the right > > interface here. We need not 'gem_context'ify it. > > > > All we need is multiple queue support for the address space (VM). > > Going to gem_context for that just because we have engine creation > > support there seems unnecessay and not correct to me. > > > > > > > >Otherwise where do the vm_bind queues live? > > > > > >In the i915/drm fd object? > > > > > >That would mean that all the GEM contexts are sharing the same vm_bind > > >queues. > > > > > > > Not all, only the gem contexts that are using the same address space (VM). > > But to me the right way to describe would be that "VM will be using those > > queues". > > > I hope by "queue" here you mean a HW resource that will be later used to > execute the job, for example a ccs compute engine. Of course queue can be > virtual so user can create more queues than what hw physically has. > > To express the concept of "VM will be using those queues", I think it make > sense to have create_queue(vm) function taking a vm parameter. This > means this queue is created for the purpose of submit job under this VM. > Later on, we can submit job (referring to objects vm_bound to the same vm) > to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just > vm_bind (object, va, vm). > > I hope the "queue" here is not the engine used to perform the vm_bind > operation itself. But if you meant a queue/engine to perform vm_bind itself > (vs a queue/engine for later job submission), then we can discuss more. I > know xe driver have similar concept and I think align the design early can > benefit the migration to xe driver. Oops, I read more on this thread and it turned out the vm_bind queue here is actually used to perform vm bind/unbind operations. XE driver has the similar concept (except it is called engine_id there). So having a queue_idx parameter is closer to xe design. That said, I still feel having a queue_idx parameter to vm_bind is a bit awkward. Vm_bind can be performed without any GPU engines, ie,. CPU itself can complete a vm bind as long as CPU have access to gpu's local memory. So the queue here have to be a virtual concept - it doesn't have a hard map to GPU blitter engine. Can someone summarize what is the benefit of the queue-idx parameter? For the purpose of ordering vm_bind and later gpu jobs? > > Regards, > Oak > > > > > Niranjana > > > > > > > >intel_context or GuC are internal details we're not concerned about. > > > > > >I don't really see the connection with the GEM context. > > > > > > > > >Maybe Oak has a different use case than Vulkan. > > > > > > > > >-Lionel > > > > > > > > >> > > >>>Regards, > > >>>Oak > > >>> > > >>>> > > >>>>Niranjana > > >>>> > > >>>>> > > >>>>>>I think the interface is clean as a interface to VM. It is > > >>>>only that we > > >>>>>>don't have a clean way to create a raw VM_BIND engine (not > > >>>>>>associated with > > >>>>>>any context) with i915 uapi. > > >>>>>>May be we can add such an interface, but I don't think that is > > >>>>worth it > > >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl > as I > > >>>>>>mentioned > > >>>>>>above). > > >>>>>>Anyone has any thoughts? > > >>>>>> > > >>>>>>> > > >>>>>>>>Another problem is, if two VMs are binding with the same > defined > > >>>>>>>>engine, > > >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2 > > >>>>>>>>(which may be > > >>>>>>>>waiting on its in_fence). > > >>>>>>> > > >>>>>>> > > >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects > > >>>>>>>with a single gem_context right now? > > >>>>>>> > > >>>>>> > > >>>>>>No, we don't have 2 VMs for a gem_context. > > >>>>>>Say if ctx1 with vm1 and ctx2 with vm2. > > >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. > > >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If > > >>>>>>those two queue indicies points to same underlying vm_bind > engine, > > >>>>>>then the second vm_bind call gets blocked until the first > > >>>>vm_bind call's > > >>>>>>'in' fence is triggered and bind completes. > > >>>>>> > > >>>>>>With per VM queues, this is not a problem as two VMs will not > endup > > >>>>>>sharing same queue. > > >>>>>> > > >>>>>>BTW, I just posted a updated PATCH series. > > >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html > > >>>>>> > > >>>>>>Niranjana > > >>>>>> > > >>>>>>> > > >>>>>>>> > > >>>>>>>>So, my preference here is to just add a 'u32 queue' index in > > >>>>>>>>vm_bind/unbind > > >>>>>>>>ioctl, and the queues are per VM. > > >>>>>>>> > > >>>>>>>>Niranjana > > >>>>>>>> > > >>>>>>>>> Thanks, > > >>>>>>>>> > > >>>>>>>>> -Lionel > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Niranjana > > >>>>>>>>> > > >>>>>>>>> >Regards, > > >>>>>>>>> > > > >>>>>>>>> >Tvrtko > > >>>>>>>>> > > > >>>>>>>>> >> > > >>>>>>>>> >>Niranjana > > >>>>>>>>> >> > > >>>>>>>>> >>> > > >>>>>>>>> >>>> I am trying to see how many queues we need and > > >>>>>>>>>don't want it to > > >>>>>>>>> be > > >>>>>>>>> >>>> arbitrarily > > >>>>>>>>> >>>> large and unduely blow up memory usage and > > >>>>>>>>>complexity in i915 > > >>>>>>>>> driver. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> I expect a Vulkan driver to use at most 2 in the > > >>>>>>>>>vast majority > > >>>>>>>>> >>>>of cases. I > > >>>>>>>>> >>>> could imagine a client wanting to create more > > >>>>than 1 sparse > > >>>>>>>>> >>>>queue in which > > >>>>>>>>> >>>> case, it'll be N+1 but that's unlikely. As far as > > >>>>>>>>>complexity > > >>>>>>>>> >>>>goes, once > > >>>>>>>>> >>>> you allow two, I don't think the complexity is > > >>>>going up by > > >>>>>>>>> >>>>allowing N. As > > >>>>>>>>> >>>> for memory usage, creating more queues means more > > >>>>>>>>>memory. That's > > >>>>>>>>> a > > >>>>>>>>> >>>> trade-off that userspace can make. Again, the > > >>>>>>>>>expected number > > >>>>>>>>> >>>>here is 1 > > >>>>>>>>> >>>> or 2 in the vast majority of cases so I don't think > > >>>>>>>>>you need to > > >>>>>>>>> worry. > > >>>>>>>>> >>> > > >>>>>>>>> >>>Ok, will start with n=3 meaning 8 queues. > > >>>>>>>>> >>>That would require us create 8 workqueues. > > >>>>>>>>> >>>We can change 'n' later if required. > > >>>>>>>>> >>> > > >>>>>>>>> >>>Niranjana > > >>>>>>>>> >>> > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> > Why? Because Vulkan has two basic kind of bind > > >>>>>>>>> >>>>operations and we > > >>>>>>>>> >>>> don't > > >>>>>>>>> >>>> > want any dependencies between them: > > >>>>>>>>> >>>> > 1. Immediate. These happen right after BO > > >>>>>>>>>creation or > > >>>>>>>>> >>>>maybe as > > >>>>>>>>> >>>> part of > > >>>>>>>>> >>>> > vkBindImageMemory() or > > VkBindBufferMemory(). These > > >>>>>>>>> >>>>don't happen > > >>>>>>>>> >>>> on a > > >>>>>>>>> >>>> > queue and we don't want them serialized > > >>>>>>>>>with anything. To > > >>>>>>>>> >>>> synchronize > > >>>>>>>>> >>>> > with submit, we'll have a syncobj in the > > >>>>>>>>>VkDevice which > > >>>>>>>>> is > > >>>>>>>>> >>>> signaled by > > >>>>>>>>> >>>> > all immediate bind operations and make > > >>>>>>>>>submits wait on > > >>>>>>>>> it. > > >>>>>>>>> >>>> > 2. Queued (sparse): These happen on a > > >>>>>>>>>VkQueue which may > > >>>>>>>>> be the > > >>>>>>>>> >>>> same as > > >>>>>>>>> >>>> > a render/compute queue or may be its own > > >>>>>>>>>queue. It's up > > >>>>>>>>> to us > > >>>>>>>>> >>>> what we > > >>>>>>>>> >>>> > want to advertise. From the Vulkan API > > >>>>>>>>>PoV, this is like > > >>>>>>>>> any > > >>>>>>>>> >>>> other > > >>>>>>>>> >>>> > queue. Operations on it wait on and signal > > >>>>>>>>>semaphores. If we > > >>>>>>>>> >>>> have a > > >>>>>>>>> >>>> > VM_BIND engine, we'd provide syncobjs to > > >>>>wait and > > >>>>>>>>> >>>>signal just like > > >>>>>>>>> >>>> we do > > >>>>>>>>> >>>> > in execbuf(). > > >>>>>>>>> >>>> > The important thing is that we don't want > > >>>>>>>>>one type of > > >>>>>>>>> >>>>operation to > > >>>>>>>>> >>>> block > > >>>>>>>>> >>>> > on the other. If immediate binds are > > >>>>>>>>>blocking on sparse > > >>>>>>>>> binds, > > >>>>>>>>> >>>> it's > > >>>>>>>>> >>>> > going to cause over-synchronization issues. > > >>>>>>>>> >>>> > In terms of the internal implementation, I > > >>>>>>>>>know that > > >>>>>>>>> >>>>there's going > > >>>>>>>>> >>>> to be > > >>>>>>>>> >>>> > a lock on the VM and that we can't actually > > >>>>>>>>>do these > > >>>>>>>>> things in > > >>>>>>>>> >>>> > parallel. That's fine. Once the dma_fences have > > >>>>>>>>> signaled and > > >>>>>>>>> >>>> we're > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> Thats correct. It is like a single VM_BIND > > >>>>engine with > > >>>>>>>>> >>>>multiple queues > > >>>>>>>>> >>>> feeding to it. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> Right. As long as the queues themselves are > > >>>>>>>>>independent and > > >>>>>>>>> >>>>can block on > > >>>>>>>>> >>>> dma_fences without holding up other queues, I think > > >>>>>>>>>we're fine. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> > unblocked to do the bind operation, I don't care if > > >>>>>>>>> >>>>there's a bit > > >>>>>>>>> >>>> of > > >>>>>>>>> >>>> > synchronization due to locking. That's > > >>>>>>>>>expected. What > > >>>>>>>>> >>>>we can't > > >>>>>>>>> >>>> afford > > >>>>>>>>> >>>> > to have is an immediate bind operation > > >>>>>>>>>suddenly blocking > > >>>>>>>>> on a > > >>>>>>>>> >>>> sparse > > >>>>>>>>> >>>> > operation which is blocked on a compute job > > >>>>>>>>>that's going > > >>>>>>>>> to run > > >>>>>>>>> >>>> for > > >>>>>>>>> >>>> > another 5ms. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one > > VM > > >>>>>>>>>doesn't block > > >>>>>>>>> the > > >>>>>>>>> >>>> VM_BIND > > >>>>>>>>> >>>> on other VMs. I am not sure about usecases > > >>>>here, but just > > >>>>>>>>> wanted to > > >>>>>>>>> >>>> clarify. > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> Yes, that's what I would expect. > > >>>>>>>>> >>>> --Jason > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> Niranjana > > >>>>>>>>> >>>> > > >>>>>>>>> >>>> > For reference, Windows solves this by allowing > > >>>>>>>>> arbitrarily many > > >>>>>>>>> >>>> paging > > >>>>>>>>> >>>> > queues (what they call a VM_BIND > > >>>>>>>>>engine/queue). That > > >>>>>>>>> >>>>design works > > >>>>>>>>> >>>> > pretty well and solves the problems in > > >>>>>>>>>question. >>>>Again, we could > > >>>>>>>>> >>>> just > > >>>>>>>>> >>>> > make everything out-of-order and require > > >>>>>>>>>using syncobjs > > >>>>>>>>> >>>>to order > > >>>>>>>>> >>>> things > > >>>>>>>>> >>>> > as userspace wants. That'd be fine too. > > >>>>>>>>> >>>> > One more note while I'm here: danvet said > > >>>>>>>>>something on > > >>>>>>>>> >>>>IRC about > > >>>>>>>>> >>>> VM_BIND > > >>>>>>>>> >>>> > queues waiting for syncobjs to > > >>>>>>>>>materialize. We don't > > >>>>>>>>> really > > >>>>>>>>> >>>> want/need > > >>>>>>>>> >>>> > this. We already have all the machinery in > > >>>>>>>>>userspace to > > >>>>>>>>> handle > > >>>>>>>>> >>>> > wait-before-signal and waiting for syncobj > > >>>>>>>>>fences to > > >>>>>>>>> >>>>materialize > > >>>>>>>>> >>>> and > > >>>>>>>>> >>>> > that machinery is on by default. It > > >>>>would actually > > >>>>>>>>> >>>>take MORE work > > >>>>>>>>> >>>> in > > >>>>>>>>> >>>> > Mesa to turn it off and take advantage of > > >>>>>>>>>the kernel > > >>>>>>>>> >>>>being able to > > >>>>>>>>> >>>> wait > > >>>>>>>>> >>>> > for syncobjs to materialize. Also, getting > > >>>>>>>>>that right is > > >>>>>>>>> >>>> ridiculously > > >>>>>>>>> >>>> > hard and I really don't want to get it > > >>>>>>>>>wrong in kernel > > >>>>>>>>> >>>>space. �� When we > > >>>>>>>>> >>>> > do memory fences, wait-before-signal will > > >>>>>>>>>be a thing. We > > >>>>>>>>> don't > > >>>>>>>>> >>>> need to > > >>>>>>>>> >>>> > try and make it a thing for syncobj. > > >>>>>>>>> >>>> > --Jason > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > Thanks Jason, > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > I missed the bit in the Vulkan spec that > > >>>>>>>>>we're allowed to > > >>>>>>>>> have a > > >>>>>>>>> >>>> sparse > > >>>>>>>>> >>>> > queue that does not implement either graphics > > >>>>>>>>>or compute > > >>>>>>>>> >>>>operations > > >>>>>>>>> >>>> : > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > "While some implementations may include > > >>>>>>>>> >>>> VK_QUEUE_SPARSE_BINDING_BIT > > >>>>>>>>> >>>> > support in queue families that also include > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > graphics and compute support, other > > >>>>>>>>>implementations may > > >>>>>>>>> only > > >>>>>>>>> >>>> expose a > > >>>>>>>>> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > family." > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > So it can all be all a vm_bind engine that > > >>>>just does > > >>>>>>>>> bind/unbind > > >>>>>>>>> >>>> > operations. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > But yes we need another engine for the > > >>>>>>>>>immediate/non-sparse > > >>>>>>>>> >>>> operations. > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > -Lionel > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > Daniel, any thoughts? > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > Niranjana > > >>>>>>>>> >>>> > > > >>>>>>>>> >>>> > >Matt > > >>>>>>>>> >>>> > > > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> Sorry I noticed this late. > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> -Lionel > > >>>>>>>>> >>>> > >> > > >>>>>>>>> >>>> > >> > > >>>>>>> > > >>>>>>> > > >>>>> > > > > > >