On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote: > On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote: > > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote: > > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > > > > Btrfs can transparently compress data written by the user. However, we'd > > > > > like to add an interface to write pre-compressed data directly to the > > > > > filesystem. This adds support for so-called "encoded writes" via > > > > > pwritev2(). > > > > > > > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this > > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which > > > > > contains metadata about the write: namely, the compression algorithm and > > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len > > > > > must be set to sizeof(struct encoded_iov), which can be used to extend > > > > > the interface in the future. The remaining iovecs contain the encoded > > > > > extent. > > > > > > > > > > A similar interface for reading encoded data can be added to preadv2() > > > > > in the future. > > > > > > > > > > Filesystems must indicate that they support encoded writes by setting > > > > > FMODE_ENCODED_IO in ->file_open(). > > > > [...] > > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded, > > > > > + struct iov_iter *from) > > > > > +{ > > > > > + if (iov_iter_single_seg_count(from) != sizeof(*encoded)) > > > > > + return -EINVAL; > > > > > + if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded)) > > > > > + return -EFAULT; > > > > > + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE && > > > > > + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) { > > > > > + iocb->ki_flags &= ~IOCB_ENCODED; > > > > > + return 0; > > > > > + } > > > > > + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES || > > > > > + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES) > > > > > + return -EINVAL; > > > > > + if (!capable(CAP_SYS_ADMIN)) > > > > > + return -EPERM; > > > > > > > > How does this capable() check interact with io_uring? Without having > > > > looked at this in detail, I suspect that when an encoded write is > > > > requested through io_uring, the capable() check might be executed on > > > > something like a workqueue worker thread, which is probably running > > > > with a full capability set. > > > > > > I discussed this more with Jens. You're right, per-IO permission checks > > > aren't going to work. In fully-polled mode, we never get an opportunity > > > to check capabilities in right context. So, this will probably require a > > > new open flag. > > > > Actually, file_ns_capable() accomplishes the same thing without a new > > open flag. Changing the capable() check to file_ns_capable() in > > init_user_ns should be enough. > > +Aleksa for openat2() and open() space > > Mmh... but if the file descriptor has been passed through a privilege > boundary, it isn't really clear whether the original opener of the > file intended for this to be possible. For example, if (as a > hypothetical example) the init process opens a service's logfile with > root privileges, then passes the file descriptor to that logfile to > the service on execve(), that doesn't mean that the service should be > able to perform compressed writes into that file, I think. Ahh, you're right. > I think that an open flag (as you already suggested) or an fcntl() > operation would do the job; but AFAIK the open() flag space has run > out, so if you hook it up that way, I think you might have to wait for > Aleksa Sarai to get something like his sys_openat2() suggestion > (https://lore.kernel.org/lkml/20190904201933.10736-12-cyphar@xxxxxxxxxx/) > merged? If I counted correctly, there's still space for a new O_ flag. One of the problems that Aleksa is solving is that unknown O_ flags are silently ignored, which isn't an issue for an O_ENCODED flag. If the kernel doesn't support it, it won't support RWF_ENCODED, either, so you'll get EOPNOTSUPP from pwritev2(). So, open flag it is...