Re: [RFC PATCH 10/11] xfs: add fs-verity support

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 14 Dec 2022 08:40:47 +1100

On Tue, Dec 13, 2022 at 12:39:39PM -0800, Eric Biggers wrote:
> On Wed, Dec 14, 2022 at 07:33:19AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote:
> > > On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> > > > 
> > > > Also add check that block size == PAGE_SIZE as fs-verity doesn't
> > > > support different sizes yet.
> > > 
> > > That's coming with
> > > https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@xxxxxxxxxx/T/#u,
> > > which I'll be resending soon and I hope to apply for 6.3.
> > > Review and testing of that patchset, along with its associated xfstests update
> > > (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@xxxxxxxxxx/T/#u),
> > > would be greatly appreciated.
> > > 
> > > Note, as proposed there will still be a limit of:
> > > 
> > > 	merkle_tree_block_size <= fs_block_size <= page_size
> > 
> > > Hopefully you don't need fs_block_size > page_size or
> > 
> > Yes, we will.
> > 
> > This back on my radar now that folios have settled down. It's
> > pretty trivial for XFS to do because we already support metadata
> > block sizes > filesystem block size. Here is an old prototype:
> > 
> > https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@xxxxxxxxxxxxx/
> 
> As per my follow-up response
> (https://lore.kernel.org/r/Y5jc7P1ZeWHiTKRF@sol.localdomain),
> I now think that wouldn't actually be a problem.

Good to hear.

> > > merkle_tree_block_size > fs_block_size?
> > 
> > That's also a desirable addition.
> > 
> > XFS is using xattrs to hold merkle tree blocks so the merkle tree
> > storage is are already independent of the filesystem block size and
> > page cache limitations. Being able to using 64kB merkle tree blocks
> > would be really handy for reducing the search depth and overall IO
> > footprint of really large files.
> 
> Well, the main problem is that using a Merkle tree block of 64K would mean that
> you can never read less than 64K at a time.

Sure, but why does that matter?

The typical cost of a 64kB IO is only about 5% more than a
4kB IO, even on slow spinning storage.  However, we bring an order
of magnitude more data into the cache with that IO, so we can then
process more data before we have to go to disk again and take
another latency hit.

FYI, we have this large 64kB block size option for directories in
XFS already - you can have a 4kB block size filesystem with a 64kB
directory block size. The larger block size is a little slower for
small directories because they have higher per-leaf block CPU
processing overhead, but once you get to millions of records in a
single directory or really high sustained IO load, the larger block
size is *much* faster because the reduction in IO latency and search
efficiency more than makes up for the single block CPU processing
overhead...

The merkle tree is little different - once we get into TB scale
files, the merkle tree is indexing millions of individual records.
At this point overall record lookup and IO efficiency dominates the
data access time, not the amount of data each individual IO
retreives from disk.

Keep in mind that the block size used for the merkle tree would be a
filesystem choice. If we have the capability to support 64kB
merkle tree blocks, then XFS can make the choice of what block size
to use at the point where we are measuring the file because we know
how large the file is at that point. And because we're storing the
merkle tree blocks in xattrs, we know exactly what block size the
merkle tree data was stored in from the xattr metadata...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx