Re: [PATCH drm-next 13/14] drm/nouveau: implement new VM_BIND UAPI

Matthew Brost <matthew.brost@xxxxxxxxx> · Sun, 22 Jan 2023 17:48:37 +0000

On Fri, Jan 20, 2023 at 11:22:45AM +0100, Boris Brezillon wrote:
> On Thu, 19 Jan 2023 16:38:06 +0000
> Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
> 
> > On Thu, Jan 19, 2023 at 04:36:43PM +0100, Danilo Krummrich wrote:
> > > On 1/19/23 05:58, Matthew Brost wrote:  
> > > > On Thu, Jan 19, 2023 at 04:44:23AM +0100, Danilo Krummrich wrote:  
> > > > > On 1/18/23 21:37, Thomas Hellström (Intel) wrote:  
> > > > > > 
> > > > > > On 1/18/23 07:12, Danilo Krummrich wrote:  
> > > > > > > This commit provides the implementation for the new uapi motivated by the
> > > > > > > Vulkan API. It allows user mode drivers (UMDs) to:
> > > > > > > 
> > > > > > > 1) Initialize a GPU virtual address (VA) space via the new
> > > > > > >      DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion of VA
> > > > > > >      space managed by the kernel and userspace, respectively.
> > > > > > > 
> > > > > > > 2) Allocate and free a VA space region as well as bind and unbind memory
> > > > > > >      to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl.
> > > > > > >      UMDs can request the named operations to be processed either
> > > > > > >      synchronously or asynchronously. It supports DRM syncobjs
> > > > > > >      (incl. timelines) as synchronization mechanism. The management of the
> > > > > > >      GPU VA mappings is implemented with the DRM GPU VA manager.
> > > > > > > 
> > > > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. The
> > > > > > >      execution happens asynchronously. It supports DRM syncobj (incl.
> > > > > > >      timelines) as synchronization mechanism. DRM GEM object locking is
> > > > > > >      handled with drm_exec.
> > > > > > > 
> > > > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the DRM
> > > > > > > GPU scheduler for the asynchronous paths.
> > > > > > > 
> > > > > > > Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx>
> > > > > > > ---
> > > > > > >    Documentation/gpu/driver-uapi.rst       |   3 +
> > > > > > >    drivers/gpu/drm/nouveau/Kbuild          |   2 +
> > > > > > >    drivers/gpu/drm/nouveau/Kconfig         |   2 +
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.c |  16 +
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.h |   1 +
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_drm.c   |  23 +-
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_drv.h   |   9 +-
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.c  | 310 ++++++++++
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.h  |  55 ++
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.c | 780 ++++++++++++++++++++++++
> > > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.h |  98 +++
> > > > > > >    11 files changed, 1295 insertions(+), 4 deletions(-)
> > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
> > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
> > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h  
> > > > > > ...  
> > > > > > > 
> > > > > > > +static struct dma_fence *
> > > > > > > +nouveau_bind_job_run(struct nouveau_job *job)
> > > > > > > +{
> > > > > > > +    struct nouveau_bind_job *bind_job = to_nouveau_bind_job(job);
> > > > > > > +    struct nouveau_uvmm *uvmm = nouveau_cli_uvmm(job->cli);
> > > > > > > +    struct bind_job_op *op;
> > > > > > > +    int ret = 0;
> > > > > > > +  
> > > > > > 
> > > > > > I was looking at how nouveau does the async binding compared to how xe
> > > > > > does it.
> > > > > > It looks to me that this function being a scheduler run_job callback is
> > > > > > the main part of the VM_BIND dma-fence signalling critical section for
> > > > > > the job's done_fence and if so, needs to be annotated as such?  
> > > > > 
> > > > > Yes, that's the case.
> > > > >   
> > > > > > 
> > > > > > For example nouveau_uvma_region_new allocates memory, which is not
> > > > > > allowed if in a dma_fence signalling critical section and the locking
> > > > > > also looks suspicious?  
> > > > > 
> > > > > Thanks for pointing this out, I missed that somehow.
> > > > > 
> > > > > I will change it to pre-allocate new regions, mappings and page tables
> > > > > within the job's submit() function.
> > > > >   
> > > > 
> > > > Yea that what we basically do in Xe, in the IOCTL step allocate all the
> > > > backing store for new page tables, populate new page tables (these are
> > > > not yet visible in the page table structure), and in last step which is
> > > > executed after all the dependencies are satified program all the leaf
> > > > entires making the new binding visible.
> > > > 
> > > > We screwed have this up by defering most of the IOCTL to a worker but
> > > > will fix this fix this one way or another soon - get rid of worker or
> > > > introduce a type of sync that is signaled after the worker + publish the
> > > > dma-fence in the worker. I'd like to close on this one soon.  
> > > > > For the ops structures the drm_gpuva_manager allocates for reporting the
> > > > > split/merge steps back to the driver I have ideas to entirely avoid
> > > > > allocations, which also is a good thing in respect of Christians feedback
> > > > > regarding the huge amount of mapping requests some applications seem to
> > > > > generate.
> > > > >   
> > > > 
> > > > It should be fine to have allocations to report the split/merge step as
> > > > this step should be before a dma-fence is published, but yea if possible
> > > > to avoid extra allocs as that is always better.  
> > > 
> > > I think we can't really ask for the split/merge steps before actually
> > > running the job, since it requires the particular VA space not to change
> > > while performing those operations.
> > > 
> > > E.g. if we'd run the split/merge steps at job submit() time the underlying
> > > VA space could be changed by other bind jobs executing before this one,
> > > which would make the calculated split/merge steps obsolete and wrong.
> > >   
> > 
> > Hmm, maybe I'm not understanding this implementation, admittedly I
> > haven't studied the gpuva manager code in detail.
> > 
> > Let me explain what we are doing in Xe.
> > 
> > Map 0x0000 - 0x3000 -> this resolves into 1 bind operation and 1 VMA
> > Unmap 0x1000-0x2000 -> this resolves into 1 unbind and 2 rebind operations
> > 
> > 1. unbind 0x0000-0x3000 -> destroy old VMA
> > 2. rebind 0x0000-0x1000 -> new VMA
> > 3. rebind 0x2000-0x3000 -> new VMA
> > 
> > All of the above steps resolving the operations can be done in the IOCTL
> > phase and VM's VMA structure is also updated. When the dependencies
> > are resolved the actual bindings are done on the GPU. We use the BO's
> > dma-resv slots to ensure there is never a window 0x0000-0x1000 and
> > 0x2000-0x3000 are never mapped with respect to execs (I forget the exact
> > details of how we do this but if you want to know I'll explain further).
> > 
> 
> Ok, so I've been contemplating the idea of pre-reserving memory for any
> future page-table updates, so I can guarantee the bind/unbind op in
> ->run_job() never fails (that's actually made more complicated in my
> case, because we don't directly control the page table updates, but
> defer that to the iommu/iopagetbl framework which does the allocation,
> so I didn't really go as far as you did). But with bind ops happening in
> a queue with dependencies to wait on, guessing what the page-table will
> look like is a bit challenging. Sure, we can pre-allocate pages for all
> levels needed to reach the leaf node(s) we're trying to insert or plan
> for the worst case scenario in case of 2MB -> 4K block splits for
> partial unmaps, but it sounds like a lot of memory reservation,
> especially if we get lot of bind requests queued. Just curious to hear
> how you solved that.

A few thing here.

First the reason we always do an unbind and possibly 2 rebinds is
because we might crossing the 1GB / 2MB page boundary size. In the
example I gave we could just unbind 0x1000-0x2000 and call it day but
we'd have to build smarts into to our code to know that is ok which we
don't have at this time.

e.g. This wouldn't ok just to do an unbind as the page size goes from
2MB to 4k:

Map   0x000000-0x200000 -> 2MB page
Unmap 0x100000-0x200000 -> need 256 4k pages now

For simplity we always do a unbind + 2 possible rebinds. Maybe we will
be a little smarter in the future if needed.

WRT to the memory allocations, we haven't really thought about
consequences of reserving memory for pending binds. My initial reaction
is I don't think this an issue. Page table memory should be relatively
small compared to user memory so having a bunch queue'd up shouldn't
really be an issue, right? Going from 2MB -> 4k splits on either side is
just another 2 4k allocs on either side. Maybe I'm wrong, I just don't
see this being a major issue. If the user abusing the VM bind interface,
e.g. Allocatiing misaligned huge chunks of data, that is kinda on them.

Matt