Re: [PATCH man-pages v5] Document encoded I/O

Omar Sandoval <osandov@xxxxxxxxxxx> · Mon, 24 Aug 2020 11:15:03 -0700

On Fri, Aug 21, 2020 at 12:24:48PM +0300, Amir Goldstein wrote:
> On Fri, Aug 21, 2020 at 10:38 AM Omar Sandoval <osandov@xxxxxxxxxxx> wrote:
> >
> > From: Omar Sandoval <osandov@xxxxxx>
> >
> > This adds a new page, encoded_io(7), providing an overview of encoded
> > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > reference it.
> >
> > Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> > Cc: linux-man <linux-man@xxxxxxxxxxxxxxx>
> > Signed-off-by: Omar Sandoval <osandov@xxxxxx>
> > ---
> 
> Omar,
> 
> Thanks for making the clarifications. Some questions below.
> 
> [...]
> 
> > +.PP
> > +As the filesystem page cache typically contains decoded data,
> > +encoded I/O bypasses the page cache.
> > +.SS Extent layout
> > +By using
> > +.IR len ,
> > +.IR unencoded_len ,
> > +and
> > +.IR unencoded_offset ,
> > +it is possible to refer to a subset of an unencoded extent.
> > +.PP
> > +In the simplest case,
> > +.I len
> > +is equal to
> > +.I unencoded_len
> > +and
> > +.I unencoded_offset
> > +is zero.
> > +This means that the entire unencoded extent is used.
> > +.PP
> > +However, suppose we read 50 bytes into a file
> > +which contains a single compressed extent.
> > +The filesystem must still return the entire compressed extent
> > +for us to be able to decompress it,
> > +so
> > +.I unencoded_len
> > +would be the length of the entire decompressed extent.
> > +However, because the read was at offset 50,
> > +the first 50 bytes should be ignored.
> > +Therefore,
> > +.I unencoded_offset
> > +would be 50,
> > +and
> > +.I len
> > +would accordingly be
> > +.IR unencoded_len\ -\ 50 .
> > +.PP
> > +Additionally, suppose we want to create an encrypted file with length 500,
> > +but the file is encrypted with a block cipher using a block size of 4096.
> > +The unencoded data would therefore include the appropriate padding,
> > +and
> > +.I unencoded_len
> > +would be 4096.
> > +However, to represent the logical size of the file,
> > +.I len
> > +would be 500
> > +(and
> > +.I unencoded_offset
> > +would be 0).
> > +.PP
> > +Similar situations can arise in other cases:
> > +.IP * 3
> > +If the filesystem pads data to the filesystem block size before compressing,
> > +then compressed files with a size unaligned to the filesystem block size will
> > +end with an extent with
> > +.I len
> > +<
> > +.IR unencoded_len .
> > +.IP *
> > +Extents cloned from the middle of a larger encoded extent with
> > +.B FICLONERANGE
> > +may have a non-zero
> > +.I unencoded_offset
> > +and/or
> > +.I len
> > +<
> > +.IR unencoded_len .
> > +.IP *
> > +If the middle of an encoded extent is overwritten,
> > +the filesystem may create extents with a non-zero
> > +.I unencoded_offset
> > +and/or
> > +.I len
> > +<
> > +.I unencoded_len
> > +for the parts that were not overwritten.
> 
> So in this case, would the reader be getting extents "out of unencoded order"?
> e.g. unencoded range 0..4096 and then unencoded range 10..20?
> Or would reader be reading the encoded full block twice, once for
> ragne 0..10 and once for range 20..4096?

The latter. If the file refers to the same encoded data twice, reading
the file sequentially with RWF_ENCODED will return it twice (with
different offsets each time). This is obviously not perfect, but it
keeps the interface simpler: the abstraction is not "what exactly is the
extent layout of the file" but rather "I want to read this logical range
of data", even if that involves pulling in some details from the extent
metadata.

> > +.SS Security
> > +Encoded I/O creates the potential for some security issues:
> > +.IP * 3
> > +Encoded writes allow writing arbitrary data which the kernel will decode on
> > +a subsequent read. Decompression algorithms are complex and may have bugs
> > +which can be exploited by maliciously crafted data.
> > +.IP *
> > +Encoded reads may return data which is not logically present in the file
> > +(see the discussion of
> > +.I len
> > +vs.
> > +.I unencoded_len
> > +above).
> > +It may not be intended for this data to be readable.
> > +.PP
> > +Therefore, encoded I/O requires privilege.
> > +Namely, the
> > +.B RWF_ENCODED
> > +flag may only be used when the file was opened with the
> > +.B O_ALLOW_ENCODED
> > +flag to
> > +.BR open (2),
> > +which requires the
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.B O_ALLOW_ENCODED
> > +may be set and cleared with
> > +.BR fcntl (2).
> > +Note that it is not cleared on
> > +.BR fork (2)
> > +or
> > +.BR execve (2);
> > +one may wish to use
> > +.B O_CLOEXEC
> > +with
> > +.BR O_ALLOW_ENCODED .
> > +.SS Filesystem support
> > +Encoded I/O is supported on the following filesystems:
> > +.TP
> > +Btrfs (since Linux 5.10)
> > +.IP
> > +Btrfs supports encoded reads and writes of compressed data.
> > +The data is encoded as follows:
> > +.RS
> > +.IP * 3
> > +If
> > +.I compression
> > +is
> > +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> > +then the encoded data is a single zlib stream.
> > +.IP *
> > +If
> > +.I compression
> > +is
> > +.BR ENCODED_IOV_COMPRESSION_LZO ,
> > +then the encoded data is compressed page by page with LZO1X
> > +and wrapped in the format documented in the Linux kernel source file
> > +.IR fs/btrfs/lzo.c .
> 
> :-/ So maybe call it ENCODED_IOV_COMPRESSION_BTRFS_LZO?
> 
> I understand why you want the encoding format not to be opaque, but
> I imagine the encoded data is not going to be migrated as is between
> different filesystems. So just call it for what it is - a private
> filesystem encoding
> format. If you have a format that is standard and other filesystems are likely
> to use, fine, but let's not make an API that discourages using
> "private" encoding, just for the sake of it and make life harder for no good
> reason.
> 
> All the reader of this man page may be interested to know is which
> filesystems are expected to support which encoding types and a general
> description of what they mean (as you did).
> Making this page wrongly appear as a standard for encoding formats is not
> going to play out well...
> 
> > +.IP *
> > +If
> > +.I compression
> > +is
> > +.BR ENCODED_IOV_COMPRESSION_ZSTD ,
> > +then the encoded data is a single zstd frame compressed with the
> > +.I windowLog
> > +compression parameter set to no more than 17.
> 
> Even that small detail is a bit limiting to filesystems and should
> therefore be tagged as a private btrfs encoding IMO.

Agreed, I'll make the LZO and ZSTD encodings Btrfs-specific. My
assumption was that decoders would look at the filesystem type from,
say, statfs(2), but making it explicit in the encoding is much better.
On the other hand, I think ENCODED_IOV_COMPRESSION_ZLIB is generic
enough to be reused.

Thanks for taking a look!