On Tue, Oct 06, 2020 at 08:17:05PM +0200, Daniel Vetter wrote: > So on the gpu we pipeline this all. So step 4 doesn't happen on the > cpu, but instead we queue up a bunch of command buffers so that the > gpu writes these pagetables (and the flushes tlbs and then does the > actual stuff userspace wants it to do). mlx5 HW does basically this as well. We just apply scheduling for this work on the device, not in the CPU. > just queue it all up and let the gpu scheduler sort out the mess. End > result is that you get a sgt that points at stuff which very well > might have nothing even remotely resembling your buffer in there at > the moment. But all the copy operations are queued up, so rsn the data > will also be there. The explanation make sense, thanks > But rdma doesn't work like that, so it looks all a bit funny. Well, I guess it could, but how would it make anything better? I can overlap building the SGL and the device PTEs with something else doing 'move', but is that a workload that needs such agressive optimization? > This is also why the precise semantics of move_notify for gpu<->gpu > sharing took forever to discuss and are still a bit wip, because you > have the inverse problem: The dma api mapping might still be there Seems like this all makes a graph of operations, can't start the next one until all deps are finished. Actually sounds a lot like futures. Would be clearer if this attach API provided some indication that the SGL is actually a future valid SGL.. Jason