On Thu, 19 Jan 2023 16:38:06 +0000 Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > On Thu, Jan 19, 2023 at 04:36:43PM +0100, Danilo Krummrich wrote: > > On 1/19/23 05:58, Matthew Brost wrote: > > > On Thu, Jan 19, 2023 at 04:44:23AM +0100, Danilo Krummrich wrote: > > > > On 1/18/23 21:37, Thomas Hellström (Intel) wrote: > > > > > > > > > > On 1/18/23 07:12, Danilo Krummrich wrote: > > > > > > This commit provides the implementation for the new uapi motivated by the > > > > > > Vulkan API. It allows user mode drivers (UMDs) to: > > > > > > > > > > > > 1) Initialize a GPU virtual address (VA) space via the new > > > > > > DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion of VA > > > > > > space managed by the kernel and userspace, respectively. > > > > > > > > > > > > 2) Allocate and free a VA space region as well as bind and unbind memory > > > > > > to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl. > > > > > > UMDs can request the named operations to be processed either > > > > > > synchronously or asynchronously. It supports DRM syncobjs > > > > > > (incl. timelines) as synchronization mechanism. The management of the > > > > > > GPU VA mappings is implemented with the DRM GPU VA manager. > > > > > > > > > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. The > > > > > > execution happens asynchronously. It supports DRM syncobj (incl. > > > > > > timelines) as synchronization mechanism. DRM GEM object locking is > > > > > > handled with drm_exec. > > > > > > > > > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the DRM > > > > > > GPU scheduler for the asynchronous paths. > > > > > > > > > > > > Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx> > > > > > > --- > > > > > > Documentation/gpu/driver-uapi.rst | 3 + > > > > > > drivers/gpu/drm/nouveau/Kbuild | 2 + > > > > > > drivers/gpu/drm/nouveau/Kconfig | 2 + > > > > > > drivers/gpu/drm/nouveau/nouveau_abi16.c | 16 + > > > > > > drivers/gpu/drm/nouveau/nouveau_abi16.h | 1 + > > > > > > drivers/gpu/drm/nouveau/nouveau_drm.c | 23 +- > > > > > > drivers/gpu/drm/nouveau/nouveau_drv.h | 9 +- > > > > > > drivers/gpu/drm/nouveau/nouveau_exec.c | 310 ++++++++++ > > > > > > drivers/gpu/drm/nouveau/nouveau_exec.h | 55 ++ > > > > > > drivers/gpu/drm/nouveau/nouveau_sched.c | 780 ++++++++++++++++++++++++ > > > > > > drivers/gpu/drm/nouveau/nouveau_sched.h | 98 +++ > > > > > > 11 files changed, 1295 insertions(+), 4 deletions(-) > > > > > > create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c > > > > > > create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h > > > > > > create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c > > > > > > create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h > > > > > ... > > > > > > > > > > > > +static struct dma_fence * > > > > > > +nouveau_bind_job_run(struct nouveau_job *job) > > > > > > +{ > > > > > > + struct nouveau_bind_job *bind_job = to_nouveau_bind_job(job); > > > > > > + struct nouveau_uvmm *uvmm = nouveau_cli_uvmm(job->cli); > > > > > > + struct bind_job_op *op; > > > > > > + int ret = 0; > > > > > > + > > > > > > > > > > I was looking at how nouveau does the async binding compared to how xe > > > > > does it. > > > > > It looks to me that this function being a scheduler run_job callback is > > > > > the main part of the VM_BIND dma-fence signalling critical section for > > > > > the job's done_fence and if so, needs to be annotated as such? > > > > > > > > Yes, that's the case. > > > > > > > > > > > > > > For example nouveau_uvma_region_new allocates memory, which is not > > > > > allowed if in a dma_fence signalling critical section and the locking > > > > > also looks suspicious? > > > > > > > > Thanks for pointing this out, I missed that somehow. > > > > > > > > I will change it to pre-allocate new regions, mappings and page tables > > > > within the job's submit() function. > > > > > > > > > > Yea that what we basically do in Xe, in the IOCTL step allocate all the > > > backing store for new page tables, populate new page tables (these are > > > not yet visible in the page table structure), and in last step which is > > > executed after all the dependencies are satified program all the leaf > > > entires making the new binding visible. > > > > > > We screwed have this up by defering most of the IOCTL to a worker but > > > will fix this fix this one way or another soon - get rid of worker or > > > introduce a type of sync that is signaled after the worker + publish the > > > dma-fence in the worker. I'd like to close on this one soon. > > > > For the ops structures the drm_gpuva_manager allocates for reporting the > > > > split/merge steps back to the driver I have ideas to entirely avoid > > > > allocations, which also is a good thing in respect of Christians feedback > > > > regarding the huge amount of mapping requests some applications seem to > > > > generate. > > > > > > > > > > It should be fine to have allocations to report the split/merge step as > > > this step should be before a dma-fence is published, but yea if possible > > > to avoid extra allocs as that is always better. > > > > I think we can't really ask for the split/merge steps before actually > > running the job, since it requires the particular VA space not to change > > while performing those operations. > > > > E.g. if we'd run the split/merge steps at job submit() time the underlying > > VA space could be changed by other bind jobs executing before this one, > > which would make the calculated split/merge steps obsolete and wrong. > > > > Hmm, maybe I'm not understanding this implementation, admittedly I > haven't studied the gpuva manager code in detail. > > Let me explain what we are doing in Xe. > > Map 0x0000 - 0x3000 -> this resolves into 1 bind operation and 1 VMA > Unmap 0x1000-0x2000 -> this resolves into 1 unbind and 2 rebind operations > > 1. unbind 0x0000-0x3000 -> destroy old VMA > 2. rebind 0x0000-0x1000 -> new VMA > 3. rebind 0x2000-0x3000 -> new VMA > > All of the above steps resolving the operations can be done in the IOCTL > phase and VM's VMA structure is also updated. When the dependencies > are resolved the actual bindings are done on the GPU. We use the BO's > dma-resv slots to ensure there is never a window 0x0000-0x1000 and > 0x2000-0x3000 are never mapped with respect to execs (I forget the exact > details of how we do this but if you want to know I'll explain further). > Ok, so I've been contemplating the idea of pre-reserving memory for any future page-table updates, so I can guarantee the bind/unbind op in ->run_job() never fails (that's actually made more complicated in my case, because we don't directly control the page table updates, but defer that to the iommu/iopagetbl framework which does the allocation, so I didn't really go as far as you did). But with bind ops happening in a queue with dependencies to wait on, guessing what the page-table will look like is a bit challenging. Sure, we can pre-allocate pages for all levels needed to reach the leaf node(s) we're trying to insert or plan for the worst case scenario in case of 2MB -> 4K block splits for partial unmaps, but it sounds like a lot of memory reservation, especially if we get lot of bind requests queued. Just curious to hear how you solved that.