On Fri, Mar 19, 2021 at 01:08:05PM -0700, Linus Torvalds wrote: > On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > > > Can we get some movement on this? Omar is sort of spinning his wheels here > > trying to get this stuff merged, no major changes have been done in a few > > postings. > > I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want > more explanations of why this should be done that way, and pollute our > iov_iter handling EVEN MORE. > > Our iov_iter stuff isn't the most legible, and I don't understand why > anybody would ever think it's a good idea to spread what is clearly a > "struct" inside multiple different iov extents. > > Honestly, this sounds way more like an ioctl interface than > read/write. We've done that before. As Josef just mentioned, I started with an ioctl, and Dave Chinner suggested doing it this way: https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@xxxxxxxxxxxxxxxxxxx/ The nice thing about it is that it sidesteps all of the issues we had with the dedupe/reflink ioctls over the years where permissions checks and the like were slightly different from read/write. > But if it has to be done with an iov_iter, is there *any* reason to > not make it be a hard rule that iov[0] should not be the entirely of > the struct, and the code shouldn't ever need to iterate? For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I made the helper more generic to support other use cases, but if that's the main objection I can easily make it specifically use iov[0]. > Also I see references to the man-page, but honestly, that's not how > the kernel UAPI should be defined ("just read the man-page"), plus I > refuse to live in the 70's, and consider troff to be an atrocious > format. No disagreement here, troff is horrible to read. > So make the UAPI explanation for this horror be in a legible format > that is actually part of the kernel so that I can read what the intent > is, instead of having to decode hieroglypics. I didn't want to document the UAPI in two places that would need to be kept in sync, but this is fair enough. I'll add UAPI documentation to the kernel. In the meantime, for anyone else following along, here's the formatted man page: ENCODED_IO(7) Linux Programmer's Manual ENCODED_IO(7) NAME encoded_io - overview of encoded I/O DESCRIPTION Several filesystems (e.g., Btrfs) support transparent encoding (e.g., compression, encryption) of data on disk: written data is encoded by the kernel before it is written to disk, and read data is decoded before being returned to the user. In some cases, it is useful to skip this encoding step. For example, the user may want to read the compressed contents of a file or write pre-compressed data directly to a file. This is referred to as "encoded I/O". Encoded I/O API Encoded I/O is specified with the RWF_ENCODED flag to preadv2(2) and pwritev2(2). If RWF_ENCODED is specified, then iov[0].iov_base points to an encoded_iov structure, defined in <linux/fs.h> as: struct encoded_iov { __aligned_u64 len; __aligned_u64 unencoded_len; __aligned_u64 unencoded_offset; __u32 compression; __u32 encryption; }; This may be extended in the future, so iov[0].iov_len must be set to sizeof(struct encoded_iov) for forward/backward compati‐ bility. The remaining buffers contain the encoded data. compression and encryption are the encoding fields. compres‐ sion is ENCODED_IOV_COMPRESSION_NONE (zero) or a filesystem- specific ENCODED_IOV_COMPRESSION_* constant; see Filesystem support below. encryption is currently always ENCODED_IOV_EN‐ CRYPTION_NONE (zero). unencoded_len is the length of the unencoded (i.e., decrypted and decompressed) data. unencoded_offset is the offset from the first byte of the unencoded data to the first byte of logi‐ cal data in the file (less than or equal to unencoded_len). len is the length of the data in the file (less than or equal to unencoded_len - unencoded_offset). See Extent layout below for some examples. If the unencoded data is actually longer than unencoded_len, then it is truncated; if it is shorter, then it is extended with zeroes. pwritev2(2) uses the metadata specified in iov[0], writes the encoded data from the remaining buffers, and returns the number of encoded bytes written (that is, the sum of iov[n].iov_len for 1 <= n < iovcnt; partial writes will not occur). At least one encoding field must be non-zero. Note that the encoded data is not validated when it is written; if it is not valid (e.g., it cannot be decompressed), then a subsequent read may return an error. If the offset argument to pwritev2(2) is -1, then the file offset is incremented by len. If iov[0].iov_len is less than sizeof(struct encoded_iov) in the kernel, then any fields unknown to user space are treated as if they were zero; if it is greater and any fields unknown to the kernel are non- zero, then this returns -1 and sets errno to E2BIG. preadv2(2) populates the metadata in iov[0], the encoded data in the remaining buffers, and returns the number of encoded bytes read. This will only return one extent per call. This can also read data which is not encoded; all encoding fields will be zero in that case. If the offset argument to preadv2(2) is -1, then the file offset is incremented by len. If iov[0].iov_len is less than sizeof(struct encoded_iov) in the kernel and any fields unknown to user space are non-zero, then preadv2(2) returns -1 and sets errno to E2BIG; if it is greater, then any fields unknown to the kernel are returned as zero. If the provided buffers are not large enough to return an entire encoded extent, then preadv2(2) returns -1 and sets errno to ENOBUFS. As the filesystem page cache typically contains decoded data, encoded I/O bypasses the page cache. Extent layout By using len, unencoded_len, and unencoded_offset, it is possi‐ ble to refer to a subset of an unencoded extent. In the simplest case, len is equal to unencoded_len and unen‐ coded_offset is zero. This means that the entire unencoded ex‐ tent is used. However, suppose we read 50 bytes into a file which contains a single compressed extent. The filesystem must still return the entire compressed extent for us to be able to decompress it, so unencoded_len would be the length of the entire decompressed extent. However, because the read was at offset 50, the first 50 bytes should be ignored. Therefore, unencoded_offset would be 50, and len would accordingly be unencoded_len - 50. Additionally, suppose we want to create an encrypted file with length 500, but the file is encrypted with a block cipher using a block size of 4096. The unencoded data would therefore in‐ clude the appropriate padding, and unencoded_len would be 4096. However, to represent the logical size of the file, len would be 500 (and unencoded_offset would be 0). Similar situations can arise in other cases: * If the filesystem pads data to the filesystem block size be‐ fore compressing, then compressed files with a size un‐ aligned to the filesystem block size will end with an extent with len < unencoded_len. * Extents cloned from the middle of a larger encoded extent with FICLONERANGE may have a non-zero unencoded_offset and/or len < unencoded_len. * If the middle of an encoded extent is overwritten, the filesystem may create extents with a non-zero unencoded_off‐ set and/or len < unencoded_len for the parts that were not overwritten. Security Encoded I/O creates the potential for some security issues: * Encoded writes allow writing arbitrary data which the kernel will decode on a subsequent read. Decompression algorithms are complex and may have bugs which can be exploited by ma‐ liciously crafted data. * Encoded reads may return data which is not logically present in the file (see the discussion of len vs unencoded_len above). It may not be intended for this data to be read‐ able. Therefore, encoded I/O requires privilege. Namely, the RWF_EN‐ CODED flag may only be used if the file description has the O_ALLOW_ENCODED file status flag set, and the O_ALLOW_ENCODED flag may only be set by a thread with the CAP_SYS_ADMIN capa‐ bility. The O_ALLOW_ENCODED flag can be set by open(2) or fc‐ ntl(2). It can also be cleared by fcntl(2); clearing it does not require CAP_SYS_ADMIN. Note that it is not cleared on fork(2) or execve(2). One may wish to use O_CLOEXEC with O_AL‐ LOW_ENCODED. Filesystem support Encoded I/O is supported on the following filesystems: Btrfs (since Linux 5.13) Btrfs supports encoded reads and writes of compressed data. The data is encoded as follows: * If compression is ENCODED_IOV_COMPRESSION_BTRFS_ZLIB, then the encoded data is a single zlib stream. * If compression is ENCODED_IOV_COMPRESSION_BTRFS_ZSTD, then the encoded data is a single zstd frame com‐ pressed with the windowLog compression parameter set to no more than 17. * If compression is one of ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_4K, ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_8K, ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_16K, ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_32K, or ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_64K, then the encoded data is com‐ pressed page by page (using the page size indicated by the name of the constant) with LZO1X and wrapped in the format documented in the Linux kernel source file fs/btrfs/lzo.c. Additionally, there are some restrictions on pwritev2(2): * offset (or the current file offset if offset is -1) must be aligned to the sector size of the filesystem. * len must be aligned to the sector size of the filesystem unless the data ends at or beyond the cur‐ rent end of the file. * unencoded_len and the length of the encoded data must each be no more than 128 KiB. This limit may in‐ crease in the future. * The length of the encoded data must be less than or equal to unencoded_len. * If using LZO, the filesystem's page size must match the compression page size. SEE ALSO open(2), preadv2(2) Linux 2020-11-11 ENCODED_IO(7)