On Thu, Sep 05, 2019 at 12:10:12PM +1000, Dave Chinner wrote: > On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@xxxxxx> > > > > This adds an API for writing compressed data directly to the filesystem. > > The use case that I have in mind is send/receive: currently, when > > sending data from one compressed filesystem to another, the sending side > > decompresses the data and the receiving side recompresses it before > > writing it out. This is wasteful and can be avoided if we can just send > > and write compressed extents. The send part will be implemented in a > > separate series, as this ioctl can stand alone. > > > > The interface is essentially pwrite(2) with some extra information: > > > > - The input buffer contains the compressed data. > > - Both the compressed and decompressed sizes of the data are given. > > - The compression type (zlib, lzo, or zstd) is given. Hi, Dave, > So why can't you do this with pwritev2()? Heaps of flags, and > use a second iovec to hold the decompressed size of the previous > iovec. i.e. > > iov[0].iov_base = compressed_data; > iov[0].iov_len = compressed_size; > iov[1].iov_base = NULL; > iov[1].iov_len = uncompressed_size; > pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB); > > And you don't need to reinvent pwritev() with some whacky ioctl that > is bound to be completely screwed up is ways not noticed until > someone else tries to use it... This is a good suggestion, thanks. I hadn't considered (ab?)using iovecs in this way. One modification I'd make would be to put the encoding into the second iovec and use a single RWF_ENCODED flag so that we don't have to keep stealing from RWF_* every time we add a new compression algorithm/encryption type/whatever: iov[0].iov_base = compressed_data; iov[0].iov_len = compressed_size; iov[1].iov_base = (void *)IOV_ENCODING_ZLIB; iov[1].iov_len = uncompressed_size; pwritev2(fd, iov, 2, offset, RWF_ENCODED); Making every other iovec a metadata iovec in this way would be a major pain to plumb through the iov_iter and VFS code, though. Instead, we could put the metadata in iov[0] and the encoded data in iov[1..iovcnt - 1]: iov[0].iov_base = (void *)IOV_ENCODING_ZLIB; iov[0].iov_len = unencoded_len; iov[1].iov_base = encoded_data1; iov[1].iov_len = encoded_size1; iov[2].iov_base = encoded_data2; iov[2].iov_len = encoded_size2; pwritev2(fd, iov, 3, offset, RWF_ENCODED); In my opinion, these are both reasonable interfaces. The former allows the user to write multiple encoded "extents" at once, while the latter allows writing a single encoded extent from scattered buffers. The latter is much simpler to implement ;) Thoughts? > I'd also suggest atht if we are going to be able to write compressed > data directly, then we should be able to read them as well directly > via preadv2().... > > > The interface is general enough that it can be extended to encrypted or > > otherwise encoded extents in the future. A more detailed description, > > including restrictions and edge cases, is included in > > include/uapi/linux/btrfs.h. > > No thanks, that bit us on the arse -hard- with the clone interfaces > we lifted to the VFS from btrfs. Let's do it through the existing IO > paths and write a bunch of fstests to exercise it and verify the > interface's utility and the filesystem implementation correctness > before anything is merged. > > > The implementation is similar to direct I/O: we have to flush any > > ordered extents, invalidate the page cache, and do the io > > tree/delalloc/extent map/ordered extent dance. > > Which, to me, says that this should be a small bit of extra code > in the direct IO path that skips the compression/decompression code > and sets a few extra flags in the iocb that is passed down to the > direct IO code. > > We don't need a whole new IO path just to skip a data transformation > step in the direct IO path.... Eh, at least for Btrfs, it's much hairier to retrofit this onto the mess of callbacks that is __blockdev_direct_IO() than it is to have a separate path. But that doesn't affect the interface, and other filesystems may be able to share more code with the direct IO path.