On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote: > > I don't think a GPU/FPGA driver will be involved, this would enter the > > block layer through the O_DIRECT path or something generic.. This the > > general flow I was suggesting to Dan earlier > > I would say the O_DIRECT path has to somehow call into the driver > backing the VMA to get an address to appropriate memory (in some way > vaguely similar to how we were discussing at LSF/MM) Maybe, maybe no. For something like VFIO the PTE already has the correct phys_addr_t and we don't need to do anything.. For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably through a new pagemap op? > If P2P can't be done at that point, then the provider driver would > do the copy to system memory, in the most appropriate way, and > return regular pages for O_DIRECT to submit to the block device. That only makes sense for the migratable DEVICE_PRIVATE case, it doesn't help the VFIO-like case, there you'd need to bounce buffer. > >> I think it would be a larger layering violation to have the NVMe driver > >> (for example) memcpy data off a GPU's bar during a dma_map step to > >> support this bouncing. And it's even crazier to expect a DMA transfer to > >> be setup in the map step. > > > > Why? Don't we already expect the DMA mapper to handle bouncing for > > lots of cases, how is this case different? This is the best place to > > place it to make it shared. > > This is different because it's special memory where the DMA mapper > can't possibly know the best way to transfer the data. Why not? If we have a 'bar info' structure that could have data transfer op callbacks, infact, I think we might already have similar callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. > One could argue that the hook to the GPU/FPGA driver could be in the > mapping step but then we'd have to do lookups based on an address -- > where as the VMA could more easily have a hook back to whatever driver > exported it. The trouble with a VMA hook is that it is only really avaiable when working with the VA, and it is not actually available during GUP, you have to have a GUP-like thing such as hmm_range_snapshot that is specifically VMA based. And it is certainly not available during dma_map. When working with VMA's/etc it seems there are some good reasons to drive things off of the PTE content (either via struct page & pgmap or via phys_addr_t & barmap) I think the best reason to prefer a uniform phys_addr_t is that it does give us the option to copy the data to/from CPU memory. That option goes away as soon as the bio sometimes provides a dma_addr_t. At least for RDMA, we do have some cases (like siw/rxe, hfi) where they sometimes need to do that copy. I suspect the block stack is similar, in the general case. Jason