Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 7 Mar 2024 13:54:01 -0800

On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > index b3506f56e180..dad33e6ff0d6 100644
> > --- a/fs/verity/fsverity_private.h
> > +++ b/fs/verity/fsverity_private.h
> > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> >  
> >  void __init fsverity_init_workqueue(void);
> >  
> > +/*
> > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > + * block->context.
> > + */
> > +void fsverity_drop_block(struct inode *inode,
> > +			 struct fsverity_blockbuf *block);
> > +
> >  #endif /* _FSVERITY_PRIVATE_H */
> 
> This should be paired with a helper function that reads a Merkle tree block by
> calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> being consistent with having a helper function for drop, this would prevent code
> duplication between verify_data_block() and fsverity_read_merkle_tree().
> 
> I recommend that it look like this:
> 
> int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> 				    unsigned long ra_bytes,
> 				    struct fsverity_blockbuf *block);
> 
> 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> would be the number of bytes for the filesystem to (optionally) readahead if the
> block is not yet cached.  I think that things work out simpler if these values
> are measured in bytes, not blocks.  'block' would be at the end because it's an
> output, and it can be confusing to interleave inputs and outputs in parameters.

FWIW I don't really like 'pos' here because that's usually short for
"file position", which is a byte, and this looks a lot more like a
merkle tree block number.

u64 blkno?

Or better yet use a typedef ("merkle_blkno_t") to make it really clear
when we're dealing with a tree block number.  Ignore checkpatch
complaining about typeedefs. :)

> Also, the drop function should be named consistently:
> 
> void fsverity_drop_merkle_tree_block(struct inode *inode,
>                                      struct fsverity_blockbuf *block);
> 
> It would be a bit long, but I think it's helpful for consistency and also
> because there are other types of "blocks" that it could be confused with.
> (Data blocks, filesystem blocks, internal blocks of the hash function.)

I agree.

> > diff --git a/fs/verity/open.c b/fs/verity/open.c
> > index fdeb95eca3af..6e6922b4b014 100644
> > --- a/fs/verity/open.c
> > +++ b/fs/verity/open.c
> > @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
> >  	if (err)
> >  		goto fail;
> >  
> > -	if (vi->tree_params.block_size != PAGE_SIZE) {
> > +	/*
> > +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> > +	 * fs-verity should use hash_block_verified bitmap as there's no page
> > +	 * to mark it with PG_checked.
> > +	 */
> > +	if (vi->tree_params.block_size != PAGE_SIZE ||
> > +			inode->i_sb->s_vop->read_merkle_tree_block) {
> 
> Technically, all filesystems "pass Merkle tree blocks to fs-verity".  Maybe
> stick to calling it "block-based Merkle tree caching", as you have in the commit
> title, as that seems clearer.
> 
> > diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> > index f58432772d9e..5da40b5a81af 100644
> > --- a/fs/verity/read_metadata.c
> > +++ b/fs/verity/read_metadata.c
> > @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
> >  {
> >  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> >  	u64 end_offset;
> > -	unsigned int offs_in_page;
> > +	unsigned int offs_in_block;
> >  	pgoff_t index, last_index;
> >  	int retval = 0;
> >  	int err = 0;
> > +	const unsigned int block_size = vi->tree_params.block_size;
> > +	const u8 log_blocksize = vi->tree_params.log_blocksize;
> >  
> >  	end_offset = min(offset + length, vi->tree_params.tree_size);
> >  	if (offset >= end_offset)
> >  		return 0;
> > -	offs_in_page = offset_in_page(offset);
> > -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> > +	offs_in_block = offset & (block_size - 1);
> > +	last_index = (end_offset - 1) >> log_blocksize;
> >  
> >  	/*
> > -	 * Iterate through each Merkle tree page in the requested range and copy
> > -	 * the requested portion to userspace.  Note that the Merkle tree block
> > -	 * size isn't important here, as we are returning a byte stream; i.e.,
> > -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> > +	 * Iterate through each Merkle tree block in the requested range and
> > +	 * copy the requested portion to userspace. Note that we are returning
> > +	 * a byte stream.
> >  	 */
> > -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> > +	for (index = offset >> log_blocksize; index <= last_index; index++) {
> >  		unsigned long num_ra_pages =
> >  			min_t(unsigned long, last_index - index + 1,
> >  			      inode->i_sb->s_bdi->io_pages);
> >  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> > -						   PAGE_SIZE - offs_in_page);
> > -		struct page *page;
> > -		const void *virt;
> > +						   block_size - offs_in_block);
> > +		struct fsverity_blockbuf block = {
> > +			.size = block_size,
> > +		};
> >  
> > -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> > -		if (IS_ERR(page)) {
> > -			err = PTR_ERR(page);
> > +		if (!vops->read_merkle_tree_block) {
> > +			unsigned int blocks_per_page =
> > +				vi->tree_params.blocks_per_page;
> > +			unsigned long page_idx =
> > +				round_down(index, blocks_per_page);
> > +			struct page *page = vops->read_merkle_tree_page(inode,
> > +					page_idx, num_ra_pages);
> > +
> > +			if (IS_ERR(page)) {
> > +				err = PTR_ERR(page);
> > +			} else {
> > +				block.kaddr = kmap_local_page(page) +
> > +					((index - page_idx) << log_blocksize);
> > +				block.context = page;
> > +			}
> > +		} else {
> > +			err = vops->read_merkle_tree_block(inode,
> > +					index << log_blocksize,
> > +					&block, log_blocksize, num_ra_pages);
> > +		}
> > +
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %d reading Merkle tree page %lu",
> > -				     err, index);
> > +				     "Error %d reading Merkle tree block %lu",
> > +				     err, index << log_blocksize);
> >  			break;
> >  		}
> >  
> > -		virt = kmap_local_page(page);
> > -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> > -			kunmap_local(virt);
> > -			put_page(page);
> > +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> > +			fsverity_drop_block(inode, &block);
> >  			err = -EFAULT;
> >  			break;
> >  		}
> > -		kunmap_local(virt);
> > -		put_page(page);
> > +		fsverity_drop_block(inode, &block);
> >  
> >  		retval += bytes_to_copy;
> >  		buf += bytes_to_copy;
> > @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
> >  			break;
> >  		}
> >  		cond_resched();
> > -		offs_in_page = 0;
> > +		offs_in_block = 0;
> >  	}
> >  	return retval ? retval : err;
> >  }
> 
> This is one of the two places that would be simplified quite a bit by adding an
> fsverity_read_merkle_tree_block() helper function, especially if the loop
> variable was changed to measure bytes instead of blocks so that the Merkle tree
> block position and readahead amount can more easily be computed in bytes.  E.g.:

Yes!  I really agree here -- I downloaded the git repo, tried to read
this function, and felt that it would be a lot easier to understand if
the two access paths were separate functions instead of a lot of
switched logic within one function.

> static int fsverity_read_merkle_tree(struct inode *inode,
> 				     const struct fsverity_info *vi,
> 				     void __user *buf, u64 pos, int length)
> {
> 	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> 	[...]
> 
> 	while (pos < end_pos) {
> 		// bytes_to_copy = remaining length in current block
> 		[...]
> 		err = fsverity_read_merkle_tree_block(inode,
> 						      pos - offs_in_block,
> 						      ra_bytes,
> 						      &block);
> 		[...]
> 		pos += bytes_to_copy;
> 		[...]
> 	}
> 
> > diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> > index 4fcad0825a12..de71911d400c 100644
> > --- a/fs/verity/verify.c
> > +++ b/fs/verity/verify.c
> > @@ -13,14 +13,17 @@
> >  static struct workqueue_struct *fsverity_read_workqueue;
> >  
> >  /*
> > - * Returns true if the hash block with index @hblock_idx in the tree, located in
> > - * @hpage, has already been verified.
> > + * Returns true if the hash block with index @hblock_idx in the tree has
> > + * already been verified.
> >   */
> > -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> > +static bool is_hash_block_verified(struct inode *inode,
> > +				   struct fsverity_blockbuf *block,
> >  				   unsigned long hblock_idx)
> >  {
> >  	unsigned int blocks_per_page;
> >  	unsigned int i;
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	struct page *hpage = (struct page *)block->context;
> >  
> >  	/*
> >  	 * When the Merkle tree block size and page size are the same, then the
> > @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> >  	if (!vi->hash_block_verified)
> >  		return PageChecked(hpage);
> >  
> > +	/*
> > +	 * Filesystems which use block based caching (e.g. XFS) always use
> > +	 * bitmap.
> > +	 */
> > +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> > +		return test_bit(hblock_idx, vi->hash_block_verified);
> >  	/*
> >  	 * When the Merkle tree block size and page size differ, we use a bitmap
> >  	 * to indicate whether each hash block has been verified.
> 
> This is hard to follow because the code goes from handling page-based caching to
> handling block-based caching to handling page-based caching again.  Also, the
> comment says that block-based caching uses the bitmap, but it doesn't give any
> hint at why the bitmap is being handled differently in that case from the
> page-based caching case that also uses the bitmap.
> 
> How about doing something like the following where the block-based caching case
> is handled first, before either block-based caching case:
> 
> 	/*
> 	 * If the filesystem uses block-based caching, then
> 	 * ->hash_block_verified is always used and the filesystem pushes
> 	 * invalidations to it as needed.
> 	 */
> 	if (inode->i_sb->s_vop->read_merkle_tree_block)
> 		return test_bit(hblock_idx, vi->hash_block_verified);
> 
> 	/* Otherwise, the filesystem uses page-based caching. */
> 	hpage = (struct page *)block->context;
> 
> Note, to avoid confusion, the 'hpage' variable shouldn't be set until it's been
> verified that the filesystem is using page-based caching.

(Yep.)

> > @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> > verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		  const void *data, u64 data_pos, unsigned long max_ra_pages)
> 
> max_ra_pages should become max_ra_bytes.
> 
> > +	int num_ra_pages;
> 
> unsigned long ra_bytes.  Also, this should be declared in the scope later on
> where it's actually needed.

Agreed.

> > @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >  		unsigned long next_hidx;
> >  		unsigned long hblock_idx;
> >  		pgoff_t hpage_idx;
> > +		u64 hblock_pos;
> >  		unsigned int hblock_offset_in_page;
> 
> hpage_idx and hblock_offset_in_page should both go away.
> fsverity_read_merkle_tree_block() should do the conversion to/from pages.
> 
> >  		unsigned int hoffset;
> >  		struct page *hpage;
> > -		const void *haddr;
> > +		struct fsverity_blockbuf *block = &hblocks[level].block;
> >  
> >  		/*
> >  		 * The index of the block in the current level; also the index
> > @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		/* Index of the hash block in the tree overall */
> >		hblock_idx = params->level_start[level] + next_hidx;
> >
> >		/* Byte offset of the hash block within the page */
> >  		hblock_offset_in_page =
> >  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
> >  
> > +		/* Offset of the Merkle tree block into the tree */
> > +		hblock_pos = hblock_idx << params->log_blocksize;
> 
> The comment for hblock_pos should say:
> 
> 	/* Byte offset of the hash block in the tree overall */
> 
> ... so that it's consistent with the one for hblock_idx which says:
> 
> 	/* Index of the hash block in the tree overall */
> 
> > +		num_ra_pages = level == 0 ?
> > +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> > +
> > +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> > +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> > +				inode, hblock_pos, block, params->log_blocksize,
> > +				num_ra_pages);
> > +		} else {
> > +			unsigned int blocks_per_page =
> > +				vi->tree_params.blocks_per_page;
> > +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> > +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> > +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> > +
> > +			if (IS_ERR(hpage)) {
> > +				err = PTR_ERR(hpage);
> > +			} else {
> > +				block->kaddr = kmap_local_page(hpage) +
> > +					hblock_offset_in_page;
> > +				block->context = hpage;
> > +			}
> > +		}
> 
> This has the weirdness with the readahead amount again, where it's still being
> measured in pages.  I think it would make more sense to measure it in bytes.  It
> would stay bytes when passed to ->read_merkle_tree_block, or be converted to
> pages at the last minute for ->read_merkle_tree_page which works in pages.
> 
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %ld reading Merkle tree page %lu",
> > -				     PTR_ERR(hpage), hpage_idx);
> > +				     "Error %d reading Merkle tree block %lu",
> > +				     err, hblock_idx);
> >  			goto error;
> >  		}
> 
> This error message should go in fsverity_read_merkle_tree_block().
> 
> Doing that would also eliminate the need to add an 'err' variable to
> verify_data_block(), which could be confusing because it returns a bool.
> 
> >  	/* Descend the tree verifying hash blocks. */
> >  	for (; level > 0; level--) {
> > -		struct page *hpage = hblocks[level - 1].page;
> > -		const void *haddr = hblocks[level - 1].addr;
> > +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> > +		const void *haddr = block->kaddr;
> >  		unsigned long hblock_idx = hblocks[level - 1].index;
> >  		unsigned int hoffset = hblocks[level - 1].hoffset;
> > +		struct page *hpage = (struct page *)block->context;
> >  
> >  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
> >  			goto error;
> > @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		if (vi->hash_block_verified)
> >			set_bit(hblock_idx, vi->hash_block_verified);
> >		else
> >  			SetPageChecked(hpage);
> 
> Since the page is only valid when vi->hash_block_verified is NULL, remove the
> hpage variable and do 'SetPageChecked((struct page *)block->context);'.
> Yes, there's no issue until the pointer is actually dereferenced, but it's
> confusing to have incorrectly-typed pointer variables sitting around.
> 
> > +/**
> > + * fsverity_invalidate_block() - invalidate Merkle tree block
> > + * @inode: inode to which this Merkle tree blocks belong
> > + * @block: block to be invalidated
> > + *
> > + * This function invalidates/clears "verified" state of Merkle tree block
> > + * in the fs-verity bitmap. The block needs to have ->offset set.
> > + */
> > +void fsverity_invalidate_block(struct inode *inode,
> > +		struct fsverity_blockbuf *block)
> > +{
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> > +
> > +	if (block->offset > vi->tree_params.tree_size) {
> > +		fsverity_err(inode,
> > +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> > +			     vi->tree_params.tree_size, block->offset);
> > +		return;
> > +	}
> > +
> > +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> > +}
> > +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
> 
> I don't think this function should be using struct fsverity_blockbuf.  It only
> uses the 'offset' field, which is orthogonal to what ->read_merkle_tree_block
> and ->drop_merkle_tree_block use.
> 
> The function name fsverity_invalidate_block() also has the issue where it's not
> clear what type of block it's talking about.
> 
> How about changing the prototype to:
> 
> void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
>
> Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> can it happen in cases like filesystem corruption?  If it can only happen due to
> kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.

I think XFS only passes to _invalidate_* the same pos that was passed to
->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
problem.

Perhaps this function ought to note that @pos is supposed to be the same
value that was given to ->read_merkle_tree_block?

Or: make the implementations return 1 for "reloaded from disk", 0 for
"still in cache", or a negative error code.  Then fsverity can call
the invalidation routine itself and XFS doesn't have to worry about this
part.

(I think?  I have questions about the xfs_invalidate_blocks function.)

> Please note that there's also an off-by-one error.  pos > tree_size should be
> pos >= tree_size.
> 
> > +void fsverity_drop_block(struct inode *inode,
> > +		struct fsverity_blockbuf *block)
> 
> fsverity_drop_merkle_tree_block
> 
> > +{
> > +	if (inode->i_sb->s_vop->drop_block)
> > +		inode->i_sb->s_vop->drop_block(block);
> > +	else {
> 
> Add braces above.
> 
> (See line 213 of Documentation/process/coding-style.rst)
> 
> > +		struct page *page = (struct page *)block->context;
> > +
> > +		kunmap_local(block->kaddr);
> > +		put_page(page);
> 
> put_page((struct page *)block->context), and remove the page variable
> 
> > +	}
> > +	block->kaddr = NULL;
> 
> Also set block->context = NULL
> 
> > +/**
> > + * struct fsverity_blockbuf - Merkle Tree block buffer
> > + * @kaddr: virtual address of the block's data
> > + * @offset: block's offset into Merkle tree
> > + * @size: the Merkle tree block size
> > + * @context: filesystem private context
> > + *
> > + * Buffer containing single Merkle Tree block. These buffers are passed
> > + *  - to filesystem, when fs-verity is building merkel tree,
> > + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> > + * Filesystems sets kaddr together with size to point to a memory which contains
> > + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> > + * written down to disk.
> > + *
> > + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> > + * ->drop_block to let filesystem know that memory can be freed.
> > + *
> > + * The context is optional. This field can be used by filesystem to passthrough
> > + * state from ->read_merkle_tree_block to ->drop_block.
> > + */
> > +struct fsverity_blockbuf {
> > +	void *kaddr;
> > +	u64 offset;
> > +	unsigned int size;
> > +	void *context;
> > +};
> 
> If this is only used by ->read_merkle_tree_block and ->drop_merkle_tree_block,
> then it can be simplified.  The only fields needed are kaddr and context.
> 
> The comment also still incorrectly claims that the struct is used for writes.
> 
> > +	/**
> > +	 * Read a Merkle tree block of the given inode.
> > +	 * @inode: the inode
> > +	 * @pos: byte offset of the block within the Merkle tree
> > +	 * @block: block buffer for filesystem to point it to the block
> > +	 * @log_blocksize: log2 of the size of the expected block
> > +	 * @ra_bytes: The number of bytes that should be
> > +	 *		prefetched starting at @pos if the page at @pos
> > +	 *		isn't already cached.  Implementations may ignore this
> > +	 *		argument; it's only a performance optimization.
> > +	 *
> > +	 * This can be called at any time on an open verity file.  It may be
> > +	 * called by multiple processes concurrently.
> > +	 *
> > +	 * In case that block was evicted from the memory filesystem has to use
> > +	 * fsverity_invalidate_block() to let fsverity know that block's
> > +	 * verification state is not valid anymore.

Ah, ok, that's why fsverity_invalidate_block only does one block at a
time.

> > +	 *
> > +	 * Return: 0 on success, -errno on failure
> > +	 */
> > +	int (*read_merkle_tree_block)(struct inode *inode,
> > +				      u64 pos,
> > +				      struct fsverity_blockbuf *block,
> > +				      unsigned int log_blocksize,
> > +				      u64 ra_bytes);
> 
> How about:
> 	
> 	int (*read_merkle_tree_block)(struct inode *inode,
> 				      u64 pos,
> 				      unsigned int size,
> 				      unsigned long ra_bytes,
> 				      struct fsverity_blockbuf *block);
> 
> - size instead of log_blocksize, for consistency with other functions and
>   because it's what XFS actually wants
> 
> - ra_bytes as unsigned long, since u64 readahead amounts don't really make
>   sense.  (This isn't too important though; it could be u64 if you really want.)
> 
> - block at the end since it's an output parameter
> 
> It also needs to be clearly documented that filesystems must implement either
> ->read_merkle_tree_page, *or* both ->read_merkle_tree_block and
> ->drop_merkle_tree_block.
> 
> Also, the comment about invalidating the block is still unclear regarding the
> timing of when the filesystem must do it.
> 
> > +
> > +	/**
> > +	 * Release the reference to a Merkle tree block
> > +	 *
> > +	 * @block: the block to release
> > +	 *
> > +	 * This is called when fs-verity is done with a block obtained with
> > +	 * ->read_merkle_tree_block().
> > +	 */
> > +	void (*drop_block)(struct fsverity_blockbuf *block);
> 
> drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block

Yep.  I noticed that xfs_verity.c doesn't put them together, which made
me wonder if the write_merkle_tree_block path made use of that.  It
doesn't, AFAICT.

And I think the reason is that when we're setting up the merkle tree,
we want to stream the contents straight to disk instead of ending up
with a huge cache that might not all be necessary?

--D

> - Eric
>