Re: [PATCH drm-next 13/14] drm/nouveau: implement new VM_BIND UAPI

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Fri, 20 Jan 2023 11:22:45 +0100



On Thu, 19 Jan 2023 16:38:06 +0000
Matthew Brost <matthew.brost@xxxxxxxxx> wrote:

> On Thu, Jan 19, 2023 at 04:36:43PM +0100, Danilo Krummrich wrote:
> > On 1/19/23 05:58, Matthew Brost wrote:  
> > > On Thu, Jan 19, 2023 at 04:44:23AM +0100, Danilo Krummrich wrote:  
> > > > On 1/18/23 21:37, Thomas Hellström (Intel) wrote:  
> > > > > 
> > > > > On 1/18/23 07:12, Danilo Krummrich wrote:  
> > > > > > This commit provides the implementation for the new uapi motivated by the
> > > > > > Vulkan API. It allows user mode drivers (UMDs) to:
> > > > > > 
> > > > > > 1) Initialize a GPU virtual address (VA) space via the new
> > > > > >      DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion of VA
> > > > > >      space managed by the kernel and userspace, respectively.
> > > > > > 
> > > > > > 2) Allocate and free a VA space region as well as bind and unbind memory
> > > > > >      to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl.
> > > > > >      UMDs can request the named operations to be processed either
> > > > > >      synchronously or asynchronously. It supports DRM syncobjs
> > > > > >      (incl. timelines) as synchronization mechanism. The management of the
> > > > > >      GPU VA mappings is implemented with the DRM GPU VA manager.
> > > > > > 
> > > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. The
> > > > > >      execution happens asynchronously. It supports DRM syncobj (incl.
> > > > > >      timelines) as synchronization mechanism. DRM GEM object locking is
> > > > > >      handled with drm_exec.
> > > > > > 
> > > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the DRM
> > > > > > GPU scheduler for the asynchronous paths.
> > > > > > 
> > > > > > Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx>
> > > > > > ---
> > > > > >    Documentation/gpu/driver-uapi.rst       |   3 +
> > > > > >    drivers/gpu/drm/nouveau/Kbuild          |   2 +
> > > > > >    drivers/gpu/drm/nouveau/Kconfig         |   2 +
> > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.c |  16 +
> > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.h |   1 +
> > > > > >    drivers/gpu/drm/nouveau/nouveau_drm.c   |  23 +-
> > > > > >    drivers/gpu/drm/nouveau/nouveau_drv.h   |   9 +-
> > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.c  | 310 ++++++++++
> > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.h  |  55 ++
> > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.c | 780 ++++++++++++++++++++++++
> > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.h |  98 +++
> > > > > >    11 files changed, 1295 insertions(+), 4 deletions(-)
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h  
> > > > > ...  
> > > > > > 
> > > > > > +static struct dma_fence *
> > > > > > +nouveau_bind_job_run(struct nouveau_job *job)
> > > > > > +{
> > > > > > +    struct nouveau_bind_job *bind_job = to_nouveau_bind_job(job);
> > > > > > +    struct nouveau_uvmm *uvmm = nouveau_cli_uvmm(job->cli);
> > > > > > +    struct bind_job_op *op;
> > > > > > +    int ret = 0;
> > > > > > +  
> > > > > 
> > > > > I was looking at how nouveau does the async binding compared to how xe
> > > > > does it.
> > > > > It looks to me that this function being a scheduler run_job callback is
> > > > > the main part of the VM_BIND dma-fence signalling critical section for
> > > > > the job's done_fence and if so, needs to be annotated as such?  
> > > > 
> > > > Yes, that's the case.
> > > >   
> > > > > 
> > > > > For example nouveau_uvma_region_new allocates memory, which is not
> > > > > allowed if in a dma_fence signalling critical section and the locking
> > > > > also looks suspicious?  
> > > > 
> > > > Thanks for pointing this out, I missed that somehow.
> > > > 
> > > > I will change it to pre-allocate new regions, mappings and page tables
> > > > within the job's submit() function.
> > > >   
> > > 
> > > Yea that what we basically do in Xe, in the IOCTL step allocate all the
> > > backing store for new page tables, populate new page tables (these are
> > > not yet visible in the page table structure), and in last step which is
> > > executed after all the dependencies are satified program all the leaf
> > > entires making the new binding visible.
> > > 
> > > We screwed have this up by defering most of the IOCTL to a worker but
> > > will fix this fix this one way or another soon - get rid of worker or
> > > introduce a type of sync that is signaled after the worker + publish the
> > > dma-fence in the worker. I'd like to close on this one soon.  
> > > > For the ops structures the drm_gpuva_manager allocates for reporting the
> > > > split/merge steps back to the driver I have ideas to entirely avoid
> > > > allocations, which also is a good thing in respect of Christians feedback
> > > > regarding the huge amount of mapping requests some applications seem to
> > > > generate.
> > > >   
> > > 
> > > It should be fine to have allocations to report the split/merge step as
> > > this step should be before a dma-fence is published, but yea if possible
> > > to avoid extra allocs as that is always better.  
> > 
> > I think we can't really ask for the split/merge steps before actually
> > running the job, since it requires the particular VA space not to change
> > while performing those operations.
> > 
> > E.g. if we'd run the split/merge steps at job submit() time the underlying
> > VA space could be changed by other bind jobs executing before this one,
> > which would make the calculated split/merge steps obsolete and wrong.
> >   
> 
> Hmm, maybe I'm not understanding this implementation, admittedly I
> haven't studied the gpuva manager code in detail.
> 
> Let me explain what we are doing in Xe.
> 
> Map 0x0000 - 0x3000 -> this resolves into 1 bind operation and 1 VMA
> Unmap 0x1000-0x2000 -> this resolves into 1 unbind and 2 rebind operations
> 
> 1. unbind 0x0000-0x3000 -> destroy old VMA
> 2. rebind 0x0000-0x1000 -> new VMA
> 3. rebind 0x2000-0x3000 -> new VMA
> 
> All of the above steps resolving the operations can be done in the IOCTL
> phase and VM's VMA structure is also updated. When the dependencies
> are resolved the actual bindings are done on the GPU. We use the BO's
> dma-resv slots to ensure there is never a window 0x0000-0x1000 and
> 0x2000-0x3000 are never mapped with respect to execs (I forget the exact
> details of how we do this but if you want to know I'll explain further).
> 

Ok, so I've been contemplating the idea of pre-reserving memory for any
future page-table updates, so I can guarantee the bind/unbind op in
->run_job() never fails (that's actually made more complicated in my
case, because we don't directly control the page table updates, but
defer that to the iommu/iopagetbl framework which does the allocation,
so I didn't really go as far as you did). But with bind ops happening in
a queue with dependencies to wait on, guessing what the page-table will
look like is a bit challenging. Sure, we can pre-allocate pages for all
levels needed to reach the leaf node(s) we're trying to insert or plan
for the worst case scenario in case of 2MB -> 4K block splits for
partial unmaps, but it sounds like a lot of memory reservation,
especially if we get lot of bind requests queued. Just curious to hear
how you solved that.