Re: How to convert I/O iterators to iterators, sglists and RDMA lists

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 24, 2022 at 08:53:28PM +0100, Al Viro wrote:
> 1) iter-to-scatterlist use is much wider than RDMA.  Other places like that
> include e.g. vhost_scsi_map_to_sgl(), p9_get_mapped_pages(),
> rds_message_zcopy_from_user(), tls_setup_from_iter()...

RDS is RDMA.  vhost_scsi_map_to_sgl and p9_get_mapped_pages do some
odd virtio thing.  But point taken, it is spread further than it should
be at the moment.  It is however a rather bad data structure that really
should not spead much further.

> 2) there's a limit to how far we can propagate an arbitrary iov_iter -
> ITER_IOVEC/ITER_UBUF ones are absolutely tied to mm_struct of the
> originating process.  We can't use them for anything async - not
> without the horrors a-la use_mm().

But why would you pass them on?  It is much better to just convert
them to a bio_vec and pass that on.  We could still feed that to n
iter later, and in fact there are a bunch of good reasons to do so.
But in pretty much all those cases you really do not want to keep
the whole iov_iter state.

> 	We can do separate sendmsg() for kvec and bvec parts,
> but that doesn't come for free either.  *AND* bvec part is very
> likely not the original iterator we got those pages from.

sendmsg model seems to be very much built around that model with
MSG_MORE.  But even with a 'converter' how do you plan to build
such a mixed iter anyay?

> My problem with all that stuff is that we ought to sort out the
> lifetime and pin_user issues around the iov_iter.  What I really
> want to avoid is "no worries, we'd extracted stuff into ITER_BVEC, it's
> stable and can be passed around in arbitrary way" kind of primitive.
> Because *that* has no chance to work.

Yes.  I think the first thing we need in this whole area is to sort
the pinning out.  After that we can talk about all kinds of convenience
helpers.

> As far as I can see, we have the following constraints:
> 
> 	* page references put into ITER_BVEC (and ITER_XARRAY) must not
> go away while the iov_iter is being used.  That's on the creator of
> iov_iter.

*nod*

> 	* pages found in iterator might be used past the lifetime of
> iterator.  We need the underlying pages to survive until the last
> use.  "Grab a page reference" is *NOT* a solution in general case.
> 	* pages found in data-destination iterator may have their
> contents modified, both during the iterator lifetime and asynchronously.

This is where the trouble start.  If you want to be able to feed
kmalloced data into throgh ITER_KVEC (or ITER_BVEC for the matter),
you can't just grab any kind of hold to it.  The only way to do that
is by telling the caller you're done with it.  I.e. how aio/io_ring/etc
use ki_complete - the callee owns the data until it declares it is done
by calling ->ki_complete.  But no 'borrowing' of refeferences as the
only sane way to do that would be page refcounts, but those do not
work for everything.

> If it has a chance to be a user-mapped page, we must either
> 	a) have it locked by caller and have no modifications after
> it gets unlocked or
> 	b) have it pinned (sensu pin_user_pages()) by the caller and
> have no modifications until the unpin_user_page().

Yes.  And I think we need a good counter part to iov_iter_pin_pages
that undoes any required pinning, so that users of iov_iter_pin_pages
and iov_iter_unpin_pages can use these helpers without even thinking
about the rules.  That requires passing some amount of state to the
unpin side.  It could just be an unsigned long with flags probably,
or we keep the iov_iter alive and look at that.

> Another issue with iov_iter_get_pages...() is that compound page turns
> into a bunch of references to individual subpages; io-uring folks have
> noticed the problem, but their solution is... inelegant.  I wonder if
> we would be better off with a variant of the primitive that would give
> out compound pages; it would need different calling conventions,
> obviously (current ones assume that all pages except the first and
> the last one have PAGE_SIZE worth of data in them).

The new name for compound pages is folios, and yes the whole get/pin
user pages machinery needs to switch to that.



[Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux