Re: How to convert I/O iterators to iterators, sglists and RDMA lists

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Mon, 17 Oct 2022 06:15:56 -0700

On Fri, Oct 14, 2022 at 04:26:57PM +0100, David Howells wrote:
>  (1) Async direct I/O.
> 
>      In the async case direct I/O, we cannot hold on to the iterator when we
>      return, even if the operation is still in progress (ie. we return
>      EIOCBQUEUED), as it is likely to be on the caller's stack.
> 
>      Also, simply copying the iterator isn't sufficient as virtual userspace
>      addresses cannot be trusted and we may have to pin the pages that
>      comprise the buffer.

This is very related to the discussion we are having related to pinning
for O_DIRECT with Ira and Al.  What block file systems do is to take
the pages from the iter and some flags on what is pinned.  We can
generalize this to store all extra state in a flags word, or byte the
bullet and allow cloning of the iter in one form or another.

>  (2) Crypto.
> 
>      The crypto interface takes scatterlists, not iterators, so we need to be
>      able to convert an iterator into a scatterlist in order to do content
>      encryption within netfslib.  Doing this in netfslib makes it easier to
>      store content-encrypted files encrypted in fscache.

Note that the scatterlist is generally a pretty bad interface.  We've
been talking for a while to have an interface that takes a page array
as an input and return an array of { dma_addr, len } tuples.  Thinking
about it taking in an iter might actually be an even better idea.

>  (3) RDMA.
> 
>      To perform RDMA, a buffer list needs to be presented as a QPE array.
>      Currently, cifs converts the iterator it is given to lists of pages, then
>      each list to a scatterlist and thence to a QPE array.  I have code to
>      pass the iterator down to the bottom, using an intermediate BVEC iterator
>      instead of a page list if I can't pass down the original directly (eg. an
>      XARRAY iterator on the pagecache), but I still end up converting it to a
>      scatterlist, which is then converted to a QPE.  I'm trying to go directly
>      from an iterator to a QPE array, thus avoiding the need to allocate an
>      sglist.

I'm not sure what you mean with QPE.  The fundamental low-level
interface in RDMA is the ib_sge.  If you feed it to RDMA READ/WRITE
requests the interface for that is the RDMA R/W API in
drivers/infiniband/core/rw.c, which currently takes a scatterlist but
to which all of the above remarks on DMA interface apply.  For RDMA
SEND that ULP has to do a dma_map_single/page to fill it, which is a
quite horrible layering violation and should move into the driver, but
that is going to a massive change to the whole RDMA subsystem, so
unlikely to happen anytime soon.

Neither case has anything to do with what should be in common iov_iter
code, all this needs to live in the RDMA subsystem as a consumer.