Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Sat, 9 Mar 2024 08:24:41 -0800

On Thu, Mar 07, 2024 at 07:50:36PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 02:49:03PM -0800, Eric Biggers wrote:
> > On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> > > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > > > index b3506f56e180..dad33e6ff0d6 100644
> > > > > --- a/fs/verity/fsverity_private.h
> > > > > +++ b/fs/verity/fsverity_private.h
> > > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > > > >  
> > > > >  void __init fsverity_init_workqueue(void);
> > > > >  
> > > > > +/*
> > > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > > > + * block->context.
> > > > > + */
> > > > > +void fsverity_drop_block(struct inode *inode,
> > > > > +			 struct fsverity_blockbuf *block);
> > > > > +
> > > > >  #endif /* _FSVERITY_PRIVATE_H */
> > > > 
> > > > This should be paired with a helper function that reads a Merkle tree block by
> > > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > > > being consistent with having a helper function for drop, this would prevent code
> > > > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > > > 
> > > > I recommend that it look like this:
> > > > 
> > > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > > > 				    unsigned long ra_bytes,
> > > > 				    struct fsverity_blockbuf *block);
> > > > 
> > > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > > > would be the number of bytes for the filesystem to (optionally) readahead if the
> > > > block is not yet cached.  I think that things work out simpler if these values
> > > > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > > > output, and it can be confusing to interleave inputs and outputs in parameters.
> > > 
> > > FWIW I don't really like 'pos' here because that's usually short for
> > > "file position", which is a byte, and this looks a lot more like a
> > > merkle tree block number.
> > > 
> > > u64 blkno?
> > > 
> > > Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> > > when we're dealing with a tree block number.  Ignore checkpatch
> > > complaining about typeedefs. :)
> > 
> > My suggestion is for 'pos' to be a byte position, in alignment with the
> > pagecache naming convention as well as ->write_merkle_tree_block which currently
> > uses a 'u64 pos' byte position too.
> > 
> > It would also work for it to be a block index, in which case it should be named
> > 'index' or 'blkno' and have type unsigned long.  I *think* that things work out
> > a bit cleaner if it's a byte position, but I could be convinced that it's
> > actually better for it to be a block index.
> 
> It's probably cleaner (or at least willy says so) to measure everything
> in bytes.  The only place that gets messy is if we want to cache merkle
> tree blocks in an xarray or something, in which case we'd want to >> by
> the block_shift to avoid leaving gaps.

...and now having just done that, yes, I would like to retain passing
log_blocksize to the ->read_merkle_tree_block function.  I cleaned it up
a bit for my own purposes:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fsverity-cleanups-6.9&id=4d36c90be4ac143f3e31989317bde7cd6c109c6e

Though I saw you had other feedback to Andrey about that. :)

--D

> > > > How about changing the prototype to:
> > > > 
> > > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> > > >
> > > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > > > can it happen in cases like filesystem corruption?  If it can only happen due to
> > > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> > > 
> > > I think XFS only passes to _invalidate_* the same pos that was passed to
> > > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> > > problem.
> > > 
> > > Perhaps this function ought to note that @pos is supposed to be the same
> > > value that was given to ->read_merkle_tree_block?
> > > 
> > > Or: make the implementations return 1 for "reloaded from disk", 0 for
> > > "still in cache", or a negative error code.  Then fsverity can call
> > > the invalidation routine itself and XFS doesn't have to worry about this
> > > part.
> > > 
> > > (I think?  I have questions about the xfs_invalidate_blocks function.)
> > 
> > It looks like XFS can invalidate blocks other than the one being read by
> > ->read_merkle_tree_block.
> > 
> > If it really was only a matter of the single block being read, then it would
> > indeed be simpler to just make it a piece of information returned from
> > ->read_merkle_tree_block.
> > 
> > If the generic invalidation function is needed, it needs to be clearly
> > documented when filesystems are expected to invalidate blocks.
> 
> I /think/ the generic invalidation is only necessary with the weird
> DOUBLE_ALLOC thing.  If the other two implementations (ext4/f2fs)
> haven't needed a "nuke from orbit" function then xfs shouldn't require
> one too.
> 
> > > > > +
> > > > > +	/**
> > > > > +	 * Release the reference to a Merkle tree block
> > > > > +	 *
> > > > > +	 * @block: the block to release
> > > > > +	 *
> > > > > +	 * This is called when fs-verity is done with a block obtained with
> > > > > +	 * ->read_merkle_tree_block().
> > > > > +	 */
> > > > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > > > 
> > > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> > > 
> > > Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> > > me wonder if the write_merkle_tree_block path made use of that.  It
> > > doesn't, AFAICT.
> > > 
> > > And I think the reason is that when we're setting up the merkle tree,
> > > we want to stream the contents straight to disk instead of ending up
> > > with a huge cache that might not all be necessary?
> > 
> > In the current patchset, fsverity_blockbuf isn't used for writes (despite the
> > comment saying it is).  I think that's fine and that it keeps things simpler.
> > Filesystems can still cache the blocks that are passed to
> > ->write_merkle_tree_block if they want to.  That already happens on ext4 and
> > f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.
> 
> Oh!  Maybe it should do that then.  At least the root block?
> 
> --D
> 
> > - Eric
> > 
>