Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region

Daniel Vetter <daniel@xxxxxxxx> · Tue, 6 Oct 2020 21:12:24 +0200

On Tue, Oct 6, 2020 at 8:38 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Tue, Oct 06, 2020 at 08:17:05PM +0200, Daniel Vetter wrote:
>
> > So on the gpu we pipeline this all. So step 4 doesn't happen on the
> > cpu, but instead we queue up a bunch of command buffers so that the
> > gpu writes these pagetables (and the flushes tlbs and then does the
> > actual stuff userspace wants it to do).
>
> mlx5 HW does basically this as well.
>
> We just apply scheduling for this work on the device, not in the CPU.
>
> > just queue it all up and let the gpu scheduler sort out the mess. End
> > result is that you get a sgt that points at stuff which very well
> > might have nothing even remotely resembling your buffer in there at
> > the moment. But all the copy operations are queued up, so rsn the data
> > will also be there.
>
> The explanation make sense, thanks
>
> > But rdma doesn't work like that, so it looks all a bit funny.
>
> Well, I guess it could, but how would it make anything better? I can
> overlap building the SGL and the device PTEs with something else doing
> 'move', but is that a workload that needs such agressive optimization?

The compounding issue with gpus is that we need entire lists of
buffers, atomically, for our dma operations. Which means that the
cliff you jump over with a working set that's slightly too big is very
steep, so that you have to pipeline your buffer moves interleaved with
dma operations to keep the hw busy. Having per page fault handling and
hw that can continue in other places while that fault is repaired
should smooth that cliff out enough that you don't need to bother.

I think at worst we might worry about unfairness. With the "entire
list of buffers" workload model gpus might starve out rdma badly by
constantly moving all the buffers around. Installing a dma_fence in
the rdma page fault handler, to keep the dma-buf busy for a small
amount of time to make sure at least the next rdma transfer goes
through without more faults should be able to fix that though. Such a
keepalive fence should be in the shared slots for dma_resv, to not
blocker other access. This wouldn't even need any other changes in
rdma (although delaying the pte zapping when we get a move_notify
would be better), since an active fence alone makes that buffer a much
less likely target for eviction.

> > This is also why the precise semantics of move_notify for gpu<->gpu
> > sharing took forever to discuss and are still a bit wip, because you
> > have the inverse problem: The dma api mapping might still be there
>
> Seems like this all makes a graph of operations, can't start the next
> one until all deps are finished. Actually sounds a lot like futures.
>
> Would be clearer if this attach API provided some indication that the
> SGL is actually a future valid SGL..

Yeah I think one of the things we've discussed is whether dma_buf
should pass around the fences more explicitly, or whether we should
continue to smash on the more implicit dma_resv tracking. Inertia won
out, at least for now because gpu drivers do all the book-keeping
directly in the shared dma_resv structure anyway, so this wouldn't
have helped to get cleaner code.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch