Re: Phyr Starter

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 11 Jan 2022 16:21:59 -0400

On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:

> > Then we are we using get_user_phyr() at all if we are just storing it
> > in a sg?
> 
> I did consider just implementing get_user_sg() (actually 4 years ago),
> but that cements the use of sg as both an input and output data structure
> for DMA mapping, which I am under the impression we're trying to get
> away from.

I know every time I talked about a get_user_sg() Christoph is against
it and we need to stop using scatter list...

> > Also 16 entries is way to small, it should be at least a whole PMD
> > worth so we don't have to relock the PMD level each iteration.
> > 
> > I would like to see a flow more like:
> > 
> >   cpu_phyr_list = get_user_phyr(uptr, 1G);
> >   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
> >   [..]
> >   dma_unmap_phyr(device, dma_phyr_list);
> >   unpin_drity_free(cpu_phy_list);
> > 
> > Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> > compatability. iommu drivers would want to implement natively, of
> > course.
> > 
> > ie no loops in drivers.
> 
> Let me just rewrite that for you ...
> 
> 	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
> 	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
> 			DMA_BIDIRECTIONAL, dma_attr);
> 	...
> 	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
> 			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
> 	sg_free_table(umem->sgt);
> 	free_user_phyrs(umem->phyrs, umem->phyr_len);

Why? As above we want to get rid of the sgl, so you are telling me to
adopt phyrs I need to increase the memory consumption by a hefty
amount to store the phyrs and still keep the sgt now? Why?

I don't need the sgt at all. I just need another list of physical
addresses for DMA. I see no issue with a phsr_list storing either CPU
Physical Address or DMA Physical Addresses, same data structure.

In the fairly important passthrough DMA case the CPU list and DMA list
are identical, so we don't even need to do anything.

In the typical iommu case my dma map's phyrs is only one entry.

Other cases require a larger allocation. This is the advantage against
today's scatterlist - it forces 24 bytes/page for *everyone* to
support niche architectures even if 8 bytes would have been fine for a
server platform.

> > > The question is whether this is the right kind of optimisation to be
> > > doing.  I hear you that we want a dense format, but it's questionable
> > > whether the kind of thing you're suggesting is actually denser than this
> > > scheme.  For example, if we have 1GB pages and userspace happens to have
> > > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> > > as a single phyr.  A power-of-two scheme would have us use four entries
> > > (3, 4-7, 8-9, 10).
> > 
> > That is not quite what I had in mind..
> > 
> > struct phyr_list {
> >    unsigned int first_page_offset_bytes;
> >    size_t total_length_bytes;
> >    phys_addr_t min_alignment;
> >    struct packed_phyr *list_of_pages;
> > };
> > 
> > Where each 'packed_phyr' is an aligned page of some kind. The packing
> > has to be able to represent any number of pfns, so we have four major
> > cases:
> >  - 4k pfns (use 8 bytes)
> >  - Natural order pfn (use 8 bytes)
> >  - 4k aligned pfns, arbitary number (use 12 bytes)
> >  - <4k aligned, arbitary length (use 16 bytes?)
> > 
> > In all cases the interior pages are fully used, only the first and
> > last page is sliced based on the two parameters in the phyr_list.
> 
> This kind of representation works for a virtually contiguous range.
> Unfortunately, that's not sufficient for some bio users (as I discovered
> after getting a representation like this enshrined in the NVMe spec as
> the PRP List).

This is what I was trying to convay with the 4th bullet, I'm not
suggesting a PRP list.

As an example coding - Use the first 8 bytes to encode this:

 51:0 - Physical address / 4k (ie pfn)
 56:52 - Order (simple, your order encoding can do better)
 61:57 - Unused
 63:62 - Mode, one of:
         00 = natural order pfn (8 bytes)
         01 = order aligned with length (12 bytes)
         10 = arbitary (12 bytes)

Then the optional 4 bytes are used as:

Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
  31:0 - # of order pages

Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
  11:0 - starting byte offset in the 4k
  31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
          length in bytes

I think this covers everything? I assume BIO cannot be doing
non-aligned contiguous transfers beyond 2M? The above can represent
33M of arbitary contiguous memory at 12 bytes/page.

If BIO really needs > 33M then we can use the extra mode to define a
16 byte entry that will cover everything.

> > The last case is, perhaps, a possible route to completely replace
> > scatterlist. Few places need true byte granularity for interior pages,
> > so we can invent some coding to say 'this is 8 byte aligned, and n
> > bytes long' that only fits < 4k or something. Exceptional cases can
> > then still work. I'm not sure what block needs here - is it just 512?
> 
> Replacing scatterlist is not my goal.  That seems like a lot more work
> for little gain.  

Well, I'm not comfortable with the idea above where RDMA would have to
take a memory penalty to use the new interface. To avoid that memory
penalty we need to get rid of scatterlist entirely.

If we do the 16 byte struct from the first email then a umem for MRs
will increase in memory consumption by 160% compared today's 24
bytes/page. I think the HPC workloads will veto this.

This is exactly why everything has been stuck here for so long. Nobody
wants to build on scatterlist and we don't have anything that can
feasibly replace it.

IMHO scatterlist has the wrong tradeoffs, the way it uses the 24 bytes
per page isn't a win in today's world.

And on the flip side, I don't see the iommu driver people being
enthused to implement something that isn't sufficiently general.

> I just want to delete page_link, offset and length from struct
> scatterlist.  Given the above sequence of calls, we're going to get
> sg lists that aren't chained.  They may have to be vmalloced, but
> they should be contiguous.

I don't understand that? Why would the SGL out of the iommu suddenly
not be chained?

>From what I've heard I'm also not keen on a physr list using vmalloc
either, that is said to be quite slow?

> > I would imagine a few steps to this process:
> >  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
> >  2) Wrapper around existing gup to get a phyr_list for user VA
> >  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
> >     (However, with full performance for iommu passthrough)
> >  4) Patches changing RDMA/VFIO/DRM to this API
> >  5) Patches optimizing get_user_phyr()
> >  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
> 
> I was thinking ...
> 
> 1. get_user_phyrs() & free_user_phyrs()
> 2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
>    scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
>    with current IOMMU drivers

IMHO, the scatterlist has to go away. The interface should be physr
list in, physr list out.

In the two cases I care most about in RDMA, not scatter list is alot
less memory than today.

For iommu pass through the DMA address and CPU address are the same,
so we can re-use the original physr list. So 8 byes/page.

For the iommu case where the iommu linearizes the whole map I end up
with 1 entry for the DMA list. Also 8 bytes/page for enough pages.

Even the degenerate case where I need to have unique DMA addresses for
each page (eg bus offset, no iommu) is still only 16 bytes per page
instead of 24. (I don't have a use case for RDMA)

The rarer case of non-page aligned transfer becomes 24 bytes and is
just as bad as SGL. (RDMA doesn't ever do this)

So in typical cases for RDMA HPC workloads I go from 24 -> 8 bytes of
long term storage. This is a huge win, IMHO.

I can live with the temporary compat performance overhead in IOMMU
mode of building a temporary SGL and then copying it to a DMA physr
list while we agitate to get the server IOMMU drivers updated, so long
as special passthrough mode is optimized to re-use the CPU list out of
the gate.

I don't realistically expect that replacing every scatterlist with
physr will ever happen. It just need to be a sufficiently general API
that it could be done to be worth all the work to implement it. IMHO.

Jason