On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote: > On 10:47 07/10, Darrick J. Wong wrote: > > On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote: > > > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote: > > > > Introduce read_inline() function hook for reading inline extents. This > > > > is performed for filesystems such as btrfs which may compress the data > > > > in the inline extents. > > > > Why don't you set iomap->inline_data to the uncompressed buffer, let > > readahead copy it to the pagecache, and free it in ->iomap_end? > > This will increase the number of copies. BTRFS uncompresses directly > into pagecache. Yes, this is an option but at the cost of efficiency. Is that such a problem? The process of decompression means the data is already hot in cache, so there isn't a huge extra penalty for moving this inline data a second time... > > > This feels like an attempt to work around "iomap doesn't support > > > compressed extents" by keeping the decompression in the filesystem, > > > instead of extending iomap to support compressed extents itself. > > > I'd certainly prefer iomap to support compressed extents, but maybe I'm > > > in a minority here. > > > > I'm not an expert on fs compression, but I get the impression that most > > filesystems handle reads by allocating some folios, reading off the disk > > into those folios, and decompressing into the pagecache folios. It > > might be kind of amusing to try to hoist that into the vfs/iomap at some > > point, but I think the next problem you'd run into is that fscrypt has > > similar requirements, since it's also a data transformation step. > > fsverity I think is less complicated because it only needs to read the > > pagecache contents at the very end to check it against the merkle tree. > > > > That, I think, is why this btrfs iomap port want to override submit_bio, > > right? So that it can allocate a separate set of folios, create a > > second bio around that, submit the second bio, decode what it read off > > the disk into the folios of the first bio, then "complete" the first bio > > so that iomap only has to update the pagecache state and doesn't have to > > know about the encoding magic? > > Yes, but that is not the only reason. BTRFS also calculates and checks > block checksums for data read during bio completions. This is no different to doing fsverity checks at read BIO io completion. We should be using the same mechanism in iomap for fsverity IO completion verification as filesystems do for data checksum verification because, conceptually speaking, they are the same operation. > > And then, having established that beachhead, porting btrfs fscrypt is > > a simple matter of adding more transformation steps to the ioend > > processing of the second bio (aka the one that actually calls the disk), > > right? And I think all that processing stuff is more or less already in > > place for the existing read path, so it should be trivial (ha!) to > > call it in an iomap context instead of straight from btrfs. > > iomap_folio_state notwithstanding, of course. > > > > Hmm. I'll have to give some thought to what would the ideal iomap data > > transformation pipeline look like? > > The order of transformation would make all the difference, and I am not > sure if everyone involved can come to a conclusion that all > transformations should be done in a particular decided order. I think there is only one viable order of data transformations to/from disk. That's because.... > FWIW, checksums are performed on what is read/written on disk. So > for writes, compression happens before checksums. .... there is specific ordering needed. For writes, the ordering is: 1. pre-write data compression - requires data copy 2. pre-write data encryption - requires data copy 3. pre-write data checksums - data read only 4. write the data 5. post-write metadata updates We cannot usefully perform compression after encryption - random data doesn't compress - and the checksum must match what is written to disk, so it has to come after all other transformations have been done. For reads, the order is: 1. read the data 2. verify the data checksum 3. decrypt the data - requires data copy 4. decompress the data - requires data copy 5. place the plain text data in the page cache To do 1) we need memory buffers allocated, 2) runs directly out of them. If no other transformations are required, then we can read the data directly into the folios in the page cache. However, if we have to decrypt and/or decompress the data, then we need multiple sets of bounce buffers - one of the data being read and one of the decrypted data. The data can be decompressed directly into the page cache. If there is no compression, the decryption should be done directly into the page cache. If there is nor decryption or compression, then the read IO should be done directly into the page cache and the checksum verification done by reading out of the page cache. IOWs, each step of the data read path needs to have "buffer" that is a set of folios that may or may not be the final page cache location. The other consideration here is we may want to be able to cache these data transformation layers when we are doing rapid RMW operations. e.g. on encrypted data, a RMW will need to allocate two sets of bounce buffers - one for the read, another for the write. If the encrypted data is also hosted in the page cache (at some high offset beyond EOF) then we don't need to allocate the bounce buffer during write.... > > Though I already have a sneaking suspicion that will morph into "If I > > wanted to add {crypt,verity,compression} to xfs how would I do that?" -> > > "How would I design a pipeine to handle all three to avoid bouncing > > pages between workqueue threads like ext4 does?" -> "Oh gosh now I have > > a totally different design than any of the existing implementations." -> > > "Well, crumbs. :(" I've actually been thinking about how to do data CRCs, encryption and compression with XFS through iomap. I've even started writing a design doc, based on feedback from the first fsverity implementation attempt. Andrey and I are essentially working towards a "direct mapped remote xattr" model for fsverity merkle tree data. fsverity reads and writes are marshalled through the page cache at a filesystem determined offset beyond EOF (same model as ext4, et al), but the data is mapped into remote xattr blocks. This requires fixing the mistake of putting metadata headers into xattr remote blocks by moving the CRC to the remote xattr name structure such that remote xattrs only contain data again. The read/write ops that are passed to iomap for such IO are fsverity implementation specific that understand that the high offset is to be mapped directly to a filesystem block sized remote xattr extent and the IO is done directly on that block. The remote xattr block CRC can be retreived when it is mapped and validated on read IO completion by the iomap code before passing the data to fsverity. The reason for doing using xattrs in this way is that it provides a generic mechanism for storing multiple sets of optional data related and/or size-transformed data within a file and within the single page caceh address space the inode provides. For example, it allows XFS to implement native data checksums. For this, we need to store 2 x 32bit CRCs per filesystem block. i.e. we need the old CRC and the new CRC so we can write/log the CRC update before we write the data, and if we crash before the data hits the disk we still know what the original CRC was and so can validate that the filesystem block contains entirely either the old or the new data when it is next read. i.e. one of the two CRC values should always be valid. By using direct mapping into the page cache for CRCs, we don't need to continually refetch the checksum data. It gets read in once, and a large read will continue to pull CRC data from the cached folio as we walk through the data we read from disk. We can fetch the CRC data concurrently with the data itself, and IO completion processing doesn't signalled until both have been brought into cache. For writes, we can calculate the new CRCs sequentially and flush the cached CRC page but to the xattr when we reach the end of it. The data write bio that was built as we CRC'd the data can then be flushed once the xattr data has reached stable storage. There's a bit more latency to writeback operations here, but nothing is ever free... Compression is where using xattrs gets interesting - the xattrs can have a fixed "offset" they blong to, but can store variable sized data records for that offset. If we say we have a 64kB compression block size, we can store the compressed data for a 64k block entirely in a remote xattr even if compression fails (i.e. we can store the raw data, not the expanded "compressed" data). The remote xattr can store any amount of smaller data, and we map the compressed data directly into the page cache at a high offset. Then decompression can run on the high offset pages with the destination being some other page cache offset.... On the write side, compression can be done directly into the high offset page cache range for that 64kb offset range, then we can map that to a remote xattr block and write the xattr. The xattr naturally handles variable size blocks. If we overwrite the compressed data (e.g. a 4kB RMW cycle in a 64kB block), then we essentially overwrite the compressed data in the page cache at writeback, extending the folio set for that record if it grows. Then on flush of the compressed record, we atomically replace the remote xattr we already have so we are guaranteed that we'll see either the old or the new compressed data at that point in time. The new remote xattr block is CRC'd by the xattr code, so if we are using compression-only, we don't actually need separate on-disk data CRC xattrs... Encryption/decryption doesn't require alternate data storage, so that just requires reading the encrypted data into a high page cache offset. However, if we are compressing then encrypting, then after the encryption step we'd need to store it via the compression method, not the uncompressed method. Hence there are some implementation details needed to be worked through here. ----- Overall, I'm looking an iomap mechanism that stages every part of the transformation stack in the page cache itself. We have address space to burn because we don't support file sizes larger than 8EiB. That leaves a full 8EiB of address space available for hosting non-user-visible ephemeral IO path data states. How to break up that address space for different data transformation uses is an open question. Files with transformed data would be limited to the size of the address space reserved by iomap for transformed data, so we want it to be large. XFS can't use more than 2^32 -extents- for xattrs on an inode at this point in time. That means >4 billion alternate data records can be stored, not that 256TB is the maximum offset that can be supported. Hence I'm tending towards at least 1PB per alternate address range, which would limit encrypted and/or compressed files to this size. That leaves over 8000 alternate address ranges iomap can support, but I suspect that we'll want fewer, larger ranges in preference to more, smaller ranges.... > > I'll start that by asking: Hey btrfs developers, what do you like and > > hate about the current way that btrfs handles fscrypt, compression, and > > fsverity? Assuming that you can set all three on a file, right? I'm definitely assuming that all 3 can be set at once, as well as low level filesystem data CRCs underneath all the layered "VFS" stuff. However, what layers we actually need at any point in time can be malleable. For XFS, compression would naturally be CRC'd by the xattr storage so it wouldn't need that layer at all. If fsverity is enabled and fs admin tooling understands fsverity, then we probably don't need fs checksums for fsverity files. so there definitely needs to be some flexibility in the stack to allow filesystems to elide transformation layers that are redundant... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx