Re: Phyr Starter

Daniel Vetter <daniel@xxxxxxxx> · Tue, 11 Jan 2022 10:05:40 +0100

Dropping some thoughts from the gpu driver perspective, feel free to
tell me it's nonsense from the mm/block view :-)

Generally I think we really, really need something like this that's
across all subsystems and consistent.

On Tue, Jan 11, 2022 at 1:41 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
>
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
>
> The same will be true of the phyr idea right?
>
> > I would like to see phyr replace bio_vec everywhere it's currently used.
> > I don't have time to do that work now because I'm busy with folios.
> > If someone else wants to take that on, I shall cheer from the sidelines.
> > What I do intend to do is:
>
> I wonder if we mixed things though..
>
> IMHO there is a lot of optimization to be had by having a
> datastructure that is expressly 'the physical pages underlying a
> contiguous chunk of va'
>
> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.

At least from the gfx side of things only allowing page sized chunks
makes a lot of sense. sg chains kinda feel wrong because they allow
byte chunks (but really that's just not allowed), so quite some
mismatch.

If we go with page size I think hardcoding a PHYS_PAGE_SIZE KB(4)
would make sense, because thanks to x86 that's pretty much the lowest
common denominator that all hw (I know of at least) supports. Not
having to fiddle with "which page size do we have" in driver code
would be neat. It makes writing portable gup code in drivers just
needlessly silly.

> An overall starting offset and total length allow computing the slice
> of the first/last entry.
>
> If the physical address is always aligned then we get 12 free bits
> from the min 4k alignment and also only need to store order, not an
> arbitary byte granular length.
>
> The win is I think we can meaningfully cover most common cases using
> only 8 bytes per physical chunk. The 12 bits can be used to encode the
> common orders (4k, 2M, 1G, etc) and some smart mechanism to get
> another 16 bits to cover 'everything'.
>
> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.
>
> I say this here, because I've always though bio_vec/etc are more
> general than we actually need, being byte granular at every chunk.
>
> >  - Add an interface to gup.c to pin/unpin N phyrs
> >  - Add a sg_map_phyrs()
> >    This will take an array of phyrs and allocate an sg for them
> >  - Whatever else I need to do to make one RDMA driver happy with
> >    this scheme
>
> I spent alot of time already cleaning all the DMA code in RDMA - it is
> now nicely uniform and ready to do this sort of change. I was
> expecting to be a bio_vec, but this is fine too.
>
> What is needed is a full scatterlist replacement, including the IOMMU
> part.
>
> For the IOMMU I would expect the datastructure to be re-used, we start
> with a list of physical pages and then 'dma map' gives us a list of
> IOVA physical pages, in another allocation, but exactly the same
> datastructure.
>
> This 'dma map' could return a pointer to the first datastructure if
> there is no iommu, allocate a single entry list if the whole thing can
> be linearly mapped with the iommu, and other baroque cases (like pci
> offset/etc) will need to allocate full array. ie good HW runs fast and
> is memory efficient.
>
> It would be nice to see a patch sketching showing what this
> datastructure could look like.
>
> VFIO would like this structure as well as it currently is a very
> inefficient page at a time loop when it iommu maps things.

I also think that it would be nice if we can have this as a consistent
set of datastructures, both for dma_addr_t and phys_addr_t. Roughly my
wishlist:
- chained by default, because of the allocation headaches, so I'm with
Jason here
- comes compressed out of gup, I think we all agree on this, I don't
really care how it's compressed as long as I don't get 512 entries for
2M pages :-)
- ideally the dma_ and the phys_ part are split since for dma-buf and
some gpu driver internal use-case there's some where really only the
dma addr is valid, and nothing else. But we can also continue the
current hack of pretending the other side isn't there (atm we scramble
those pointers to make sure dma-buf users don't cheat)
- I think minimally an sg list form of dma-mapped stuff which does not
have a struct page, iirc last time we discussed that we agreed that
this really needs to be part of such a rework or it's not really
improving things much
- a few per-entry driver bits would be nice in both the phys/dma
chains, if we can have them. gpus have funny gpu interconnects, this
would allow us to put all the gpu addresses into dma_addr_t if we can
have some bits indicating whether it's on the pci bus, gpu local
memory or the gpu<->gpu interconnect.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch