On Thu, Oct 19, 2023 at 05:43:11PM +0100, Robin Murphy wrote: > On 19/10/2023 4:25 pm, Chuck Lever wrote: > > The SunRPC stack manages pages (and eventually, folios) via an > > array of struct biovec items within struct xdr_buf. We have not > > fully committed to replacing the struct page array in xdr_buf > > because, although the socket API supports biovec arrays, the RDMA > > stack uses struct scatterlist rather than struct biovec. > > > > This (incomplete) series explores what it might look like if the > > RDMA core API could support struct biovec array arguments. The > > series compiles on x86, but I haven't tested it further. I'm posting > > early in hopes of starting further discussion. > > > > Are there other upper layer API consumers, besides SunRPC, who might > > prefer the use of biovec over scatterlist? > > > > Besides handling folios as well as single pages in bv_page, what > > other work might be needed in the DMA layer? > > Eww, please no. It's already well established that the scatterlist design is > horrible and we want to move to something sane and actually suitable for > modern DMA scenarios. Something where callers can pass a set of > pages/physical address ranges in, and get a (separate) set of DMA ranges > out. Without any bonkers packing of different-length lists into the same > list structure. IIRC Jason did a bit of prototyping a while back, but it may > be looking for someone else to pick up the idea and give it some more > attention. I put it aside for the moment as the direction changed after the conference somewhat. > What we definitely don't what at this point is a copy-paste of the same bad > design with all the same problems. I would have to NAK patch 8 on principle, > because the existing iommu_dma_map_sg() stuff has always been utterly mad, > but it had to be to work around the limitations of the existing scatterlist > design while bridging between two other established APIs; there's no good > excuse for having *two* copies of all that to maintain if one doesn't have > an existing precedent to fit into. The idea from HCH I've been going toward was to allow each subsystem to do what made sense for it. The dma api would provide some more generic interfaces that could be used to implement a map_sg without having to be tightly coupled to the DMA subsystem itself. The concept would be to allow something like NVMe to go directly from current BIO into its native HW format, without having to do a round trip into an intermediate storage array. How this formulates to RDMA work requests I haven't thought about, this is a large enough thing that I need some mlx5 driver support to do the first step and that was supposed to be this month but a war has caused some delay :( RDMA has a complicated historical relationship to the dma_api, sadly. This plan also wants the significant archs to all use the common dma-iommu - now that S390 is migrated only power remains... Jason