On Wed, Jun 23, 2021 at 11:28:15AM -0700, Linus Torvalds wrote: > On Wed, Jun 23, 2021 at 10:49 AM Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > > > Al, Linus, what do you think? Is there a path forward for this series as > > is? > > So the "read from user space in order to write" is a no-go for me. It > completely violates what a "read()" system call should do. It also > entirely violates what an iovec can and should do. > > And honestly, if Al hates the "first iov entry" model, I'm not sure I > want to merge that version - I personally find it fine, but Al is > effectively the iov-iter maintainer. > > I do worry a bit about the "first iov entry" simply because it might > work for "writev2()" when given virtual user space addresses - but I > think it's conceptually broken for things like direct-IO which might > do things by physical address, and what is a contiguous user space > virtual address is not necessarily a contiguous physical address. > > Yes, the filesystem can - and does - hide that path by basically not > doing direct-IO on the first entry at all, and just treat is very > specially in the front end of the IO access, but that only reinforces > the whole "this is not at all like read/write". > > Similar issues might crop up in other situations, ie splice etc, where > it's not at all obvious that the iov_iter boundaries would be > maintained as it moves through the system. > > So while I personally find the "first iov entry" model fairly > reasonable, I think Dave is being disingenuous when he says that it > looks like a normal read/write. It very much does not. The above is > quite fundamental. > > > I'd be happy to have this functionality merged in any form, but I do > > think that this approach with preadv2/pwritev2 using iov_len is decent > > relative to the alternatives. > > As mentioned, I find it acceptable. I'm completely unimpressed with > Dave's argument, but ioctl's aren't perfect either, so weak or not, > that argument being bogus doesn't necessarily mean that the iovec > entry model is wrong. > > That said, thinking about exactly the fact that I don't think a > translation from iovec to anything else can be truly valid, I find the > iter_is_iovec() case to be the only obviously valid one. > > Which gets me back to: how can any of the non-iovec alternatives ever > be valid? You did mention having missed ITER_XARRAY, but my question > is more fundamental than that. How could a non-iter_is_iovec ever be > valid? There are no possible interfaces that can generate such a thing > sanely. I only implemented the bvec and kvec cases for completeness, since copy_struct_from_iter() would appear to be a generic helper. At least for RWF_ENCODED, a bvec seems pretty bogus, but it doesn't seem too far-flung to imagine an in-kernel user of RWF_ENCODED that uses a kvec. One other option that we haven't considered is ditching the copy_struct_from_user() semantics and going the simpler route of adding some reserved space to the end of struct encoded_iov: struct encoded_iov { __aligned_u64 len; __aligned_u64 unencoded_len; __aligned_u64 unencoded_offset; __u32 compression; __u32 encryption; __u8 reserved[32]; }; Then we can do an unconditional copy_from_user_full(sizeof(struct encoded_iov)) and check the reserved space in the typical fashion. (And in the unlikely case that we use up all of that space with extensions, I suppose we could have an RWF_ENCODED2 with a matching struct encoded_iov2.)