On Tue, Dec 13, 2022 at 12:39:39PM -0800, Eric Biggers wrote: > On Wed, Dec 14, 2022 at 07:33:19AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote: > > > On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote: > > > > > > > > Also add check that block size == PAGE_SIZE as fs-verity doesn't > > > > support different sizes yet. > > > > > > That's coming with > > > https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@xxxxxxxxxx/T/#u, > > > which I'll be resending soon and I hope to apply for 6.3. > > > Review and testing of that patchset, along with its associated xfstests update > > > (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@xxxxxxxxxx/T/#u), > > > would be greatly appreciated. > > > > > > Note, as proposed there will still be a limit of: > > > > > > merkle_tree_block_size <= fs_block_size <= page_size > > > > > Hopefully you don't need fs_block_size > page_size or > > > > Yes, we will. > > > > This back on my radar now that folios have settled down. It's > > pretty trivial for XFS to do because we already support metadata > > block sizes > filesystem block size. Here is an old prototype: > > > > https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@xxxxxxxxxxxxx/ > > As per my follow-up response > (https://lore.kernel.org/r/Y5jc7P1ZeWHiTKRF@sol.localdomain), > I now think that wouldn't actually be a problem. Good to hear. > > > merkle_tree_block_size > fs_block_size? > > > > That's also a desirable addition. > > > > XFS is using xattrs to hold merkle tree blocks so the merkle tree > > storage is are already independent of the filesystem block size and > > page cache limitations. Being able to using 64kB merkle tree blocks > > would be really handy for reducing the search depth and overall IO > > footprint of really large files. > > Well, the main problem is that using a Merkle tree block of 64K would mean that > you can never read less than 64K at a time. Sure, but why does that matter? The typical cost of a 64kB IO is only about 5% more than a 4kB IO, even on slow spinning storage. However, we bring an order of magnitude more data into the cache with that IO, so we can then process more data before we have to go to disk again and take another latency hit. FYI, we have this large 64kB block size option for directories in XFS already - you can have a 4kB block size filesystem with a 64kB directory block size. The larger block size is a little slower for small directories because they have higher per-leaf block CPU processing overhead, but once you get to millions of records in a single directory or really high sustained IO load, the larger block size is *much* faster because the reduction in IO latency and search efficiency more than makes up for the single block CPU processing overhead... The merkle tree is little different - once we get into TB scale files, the merkle tree is indexing millions of individual records. At this point overall record lookup and IO efficiency dominates the data access time, not the amount of data each individual IO retreives from disk. Keep in mind that the block size used for the merkle tree would be a filesystem choice. If we have the capability to support 64kB merkle tree blocks, then XFS can make the choice of what block size to use at the point where we are measuring the file because we know how large the file is at that point. And because we're storing the merkle tree blocks in xattrs, we know exactly what block size the merkle tree data was stored in from the xattr metadata... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx