On Thu, Oct 20, 2022 at 03:03:56PM +0100, David Howells wrote: > Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > > (1) Async direct I/O. > > > > > > In the async case direct I/O, we cannot hold on to the iterator when we > > > return, even if the operation is still in progress (ie. we return > > > EIOCBQUEUED), as it is likely to be on the caller's stack. > > > > > > Also, simply copying the iterator isn't sufficient as virtual userspace > > > addresses cannot be trusted and we may have to pin the pages that > > > comprise the buffer. > > > > This is very related to the discussion we are having related to pinning > > for O_DIRECT with Ira and Al. > > Do you have a link to that discussion? I don't see anything obvious on > fsdevel including Ira. I think Christoph meant to say John Hubbard. > > I do see a discussion involving iov_iter_pin_pages, but I don't see Ira > involved in that. This one? https://lore.kernel.org/all/20220831041843.973026-5-jhubbard@xxxxxxxxxx/ I've been casually reading it but not directly involved. Ira > > > What block file systems do is to take the pages from the iter and some flags > > on what is pinned. We can generalize this to store all extra state in a > > flags word, or byte the bullet and allow cloning of the iter in one form or > > another. > > Yeah, I know. A list of pages is not an ideal solution. It can only handle > contiguous runs of pages, possibly with a partial page at either end. A bvec > iterator would be of more use as it can handle a series of partial pages. > > Note also that I would need to turn the pages *back* into an iterator in order > to commune with sendmsg() in the nether reaches of some network filesystems. > > > > (2) Crypto. > > > > > > The crypto interface takes scatterlists, not iterators, so we need to > > > be able to convert an iterator into a scatterlist in order to do > > > content encryption within netfslib. Doing this in netfslib makes it > > > easier to store content-encrypted files encrypted in fscache. > > > > Note that the scatterlist is generally a pretty bad interface. We've > > been talking for a while to have an interface that takes a page array > > as an input and return an array of { dma_addr, len } tuples. Thinking > > about it taking in an iter might actually be an even better idea. > > It would be nice to be able to pass an iterator to the crypto layer. I'm not > sure what the crypto people think of that. > > > > (3) RDMA. > > > > > > To perform RDMA, a buffer list needs to be presented as a QPE array. > > > Currently, cifs converts the iterator it is given to lists of pages, > > > then each list to a scatterlist and thence to a QPE array. I have > > > code to pass the iterator down to the bottom, using an intermediate > > > BVEC iterator instead of a page list if I can't pass down the > > > original directly (eg. an XARRAY iterator on the pagecache), but I > > > still end up converting it to a scatterlist, which is then converted > > > to a QPE. I'm trying to go directly from an iterator to a QPE array, > > > thus avoiding the need to allocate an sglist. > > > > I'm not sure what you mean with QPE. The fundamental low-level > > interface in RDMA is the ib_sge. > > Sorry, yes. ib_sge array. I think it appears as QPs on the wire. > > > If you feed it to RDMA READ/WRITE requests the interface for that is the > > RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a > > scatterlist but to which all of the above remarks on DMA interface apply. > > For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is > > a quite horrible layering violation and should move into the driver, but > > that is going to a massive change to the whole RDMA subsystem, so unlikely > > to happen anytime soon. > > In cifs, as it is upstream, in RDMA transmission, the iterator is converted > into a clutch of pages in the top, which is converted back into iterators > (smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into > sge lists (see smbd_post_send_sgl()). > > I have patches that pass an iterator (which it decants to a bvec if async) all > the way down to the bottom layer. Snippets are then converted to scatterlists > and those to sge lists. I would like to skip the scatterlist intermediate and > convert directly to sge lists. > > On the other hand, if you think the RDMA API should be taking scatterlists > rather than sge lists, that would be fine. Even better if I can just pass an > iterator in directly - though neither scatterlist nor iterator has a place to > put the RDMA local_dma_key - though I wonder if that's actually necessary for > each sge element, or whether it could be handed through as part of the request > as a hole. > > > Neither case has anything to do with what should be in common iov_iter > > code, all this needs to live in the RDMA subsystem as a consumer. > > That's fine in principle. However, I have some extraction code that can > convert an iterator to another iterator, an sglist or an rdma sge list, using > a common core of code to do all three. > > I can split it up if that is preferable. > > Do you have code that's ready to be used? I can make immediate use of it. > > David >