On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote: > > > Finally, it may be possible to stop using scatterlist to describe the > > input to the DMA-mapping operation. We may be able to get struct > > scatterlist down to just dma_address and dma_length, with chaining > > handled through an enclosing struct. > > Can you talk about this some more? IMHO one of the key properties of > the scatterlist is that it can hold huge amounts of pages without > having to do any kind of special allocation due to the chaining. > > The same will be true of the phyr idea right? My thinking is that we'd pass a relatively small array of phyr (maybe 16 entries) to get_user_phyr(). If that turned out not to be big enough, then we have two options; one is to map those 16 ranges with sg and use the sg chaining functionality before throwing away the phyr and calling get_user_phyr() again. The other is to stash those 16 ranges somewhere (eg a resizing array of some kind) and keep calling get_user_phyr() to get the next batch of 16; once we've got the entire range, call sg_map_phyr() passing all of the phyrs. > > I would like to see phyr replace bio_vec everywhere it's currently used. > > I don't have time to do that work now because I'm busy with folios. > > If someone else wants to take that on, I shall cheer from the sidelines. > > What I do intend to do is: > > I wonder if we mixed things though.. > > IMHO there is a lot of optimization to be had by having a > datastructure that is expressly 'the physical pages underlying a > contiguous chunk of va' > > If you limit to that scenario then we can be more optimal because > things like byte granular offsets and size in the interior pages don't > need to exist. Every interior chunk is always aligned to its order and > we only need to record the order. > > An overall starting offset and total length allow computing the slice > of the first/last entry. > > If the physical address is always aligned then we get 12 free bits > from the min 4k alignment and also only need to store order, not an > arbitary byte granular length. > > The win is I think we can meaningfully cover most common cases using > only 8 bytes per physical chunk. The 12 bits can be used to encode the > common orders (4k, 2M, 1G, etc) and some smart mechanism to get > another 16 bits to cover 'everything'. > > IMHO storage density here is quite important, we end up having to keep > this stuff around for a long time. > > I say this here, because I've always though bio_vec/etc are more > general than we actually need, being byte granular at every chunk. Oh, I can do you one better on the bit-packing scheme. There's a representation of every power-of-two that is naturally aligned, using just one extra bit. Let's say we have 3 bits of address space and 4 bits to represent any power of two allocation within that address space: 0000 index-0, order-0 0010 index-1, order-0 ... 1110 index-7, order-0 0001 index-0, order-1 0101 index-2, order-1 1001 index-4, order-1 1101 index-6, order-1 0011 index-0, order-2 1011 index-4, order-2 0111 index-0, order-3 1111 has no meaning and can be used to represent an invalid range, if that's useful. The lowest clear bit decodes to the order, and (x & (x+1))/2 gets you the index. That leaves you with another 11 bits to represent something smart about partial pages. The question is whether this is the right kind of optimisation to be doing. I hear you that we want a dense format, but it's questionable whether the kind of thing you're suggesting is actually denser than this scheme. For example, if we have 1GB pages and userspace happens to have allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented as a single phyr. A power-of-two scheme would have us use four entries (3, 4-7, 8-9, 10). Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very cheap. If I have to walk PTEs looking for pages which can be combined together, I end up with interesting behaviour where the length of the list shrinks and expands. Using the example above, as I walk successive PUDs, the data struct looks like this: (3) (3, 4) (3, 4-5) (3, 4-5, 6) (3, 4-7) (3, 4-7, 8) (3, 4-7, 8-9) (3, 4-7, 8-9, 10) We could end up with a situation where we stop because the array is full, even though if we kept going, it'd shrink back down below the length of the array (in this example, an array of length 2 would stop when it saw page 6, even though page 7 shrinks it back down again). > What is needed is a full scatterlist replacement, including the IOMMU > part. > > For the IOMMU I would expect the datastructure to be re-used, we start > with a list of physical pages and then 'dma map' gives us a list of > IOVA physical pages, in another allocation, but exactly the same > datastructure. > > This 'dma map' could return a pointer to the first datastructure if > there is no iommu, allocate a single entry list if the whole thing can > be linearly mapped with the iommu, and other baroque cases (like pci > offset/etc) will need to allocate full array. ie good HW runs fast and > is memory efficient. > > It would be nice to see a patch sketching showing what this > datastructure could look like. > > VFIO would like this structure as well as it currently is a very > inefficient page at a time loop when it iommu maps things. I agree that you need these things. I think I'll run into trouble if I try to do them for you ... so I'm going to stop after doing the top end (pinning pages and getting them into an sg list) and let people who know that area better than I do work on that.