Re: [PATCH] fsverity: support block-based Merkle tree caching

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 14, 2024 at 06:53:20PM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@xxxxxxxxxx>
> 
> Currently fs/verity/ assumes that filesystems cache Merkle tree blocks
> in the page cache.  Specifically, it requires that filesystems provide a
> ->read_merkle_tree_page() method which returns a page of blocks.  It
> also stores the "is the block verified" flag in PG_checked, or (if there
> are multiple blocks per page) in a bitmap, with PG_checked used to
> detect cache evictions instead.  This solution is specific to the page
> cache, as a different cache would store the flag in a different way.
> 
> To allow XFS to use a custom Merkle tree block cache, this patch
> refactors the Merkle tree caching interface to be based around the
> concept of reading and dropping blocks (not pages), where the storage of
> the "is the block verified" flag is up to the implementation.
> 
> The existing pagecache based solution, used by ext4, f2fs, and btrfs, is
> reimplemented using this interface.
> 
> Co-developed-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx>
> Signed-off-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx>
> Co-developed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> Signed-off-by: Eric Biggers <ebiggers@xxxxxxxxxx>
> ---
> 
> This reworks the block-based caching patch to clean up many different
> things, including putting the pagecache based caching behind the same
> interface as suggested by Christoph.

I gather this means that you ported btrfs/f2fs/ext4 to use the read/drop
merkle_tree_block interfaces?

>                                       This applies to mainline commit
> a5131c3fdf26.  It corresponds to the following patches in Darrick's v5.6
> patchset:
> 
>     fsverity: convert verification to use byte instead of page offsets
>     fsverity: support block-based Merkle tree caching
>     fsverity: pass the merkle tree block level to fsverity_read_merkle_tree_block
>     fsverity: pass the zero-hash value to the implementation
> 
> (I don't really understand the split between the first two, as I see
> them as being logically part of the same change.  The new parameters
> would make sense to split out though.)

I separated the first two to reduce the mental burden of rebasing these
patches against new -rc1 kernels.  It's a lot less effort if one only
has to concentrate on one aspect at a time.  You might have heard that
it's difficult to add an xfs feature without it taking multiple kernel
cycles.

(That said, 6.10 wasn't bad at all.)

--D

> If we do go with my version of the patch, also let me know if there are
> any preferences for who should be author / co-developer / etc.
> 
>  fs/btrfs/verity.c            |  36 +++---
>  fs/ext4/verity.c             |  20 ++--
>  fs/f2fs/verity.c             |  20 ++--
>  fs/verity/fsverity_private.h |  13 ++-
>  fs/verity/open.c             |  38 ++++--
>  fs/verity/read_metadata.c    |  68 +++++------
>  fs/verity/verify.c           | 216 +++++++++++++++++++++++++----------
>  include/linux/fsverity.h     | 112 +++++++++++++++---
>  8 files changed, 366 insertions(+), 157 deletions(-)
> 
> diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> index 4042dd6437ae..c4ecae418669 100644
> --- a/fs/btrfs/verity.c
> +++ b/fs/btrfs/verity.c
> @@ -699,33 +699,28 @@ int btrfs_get_verity_descriptor(struct inode *inode, void *buf, size_t buf_size)
>  }
>  
>  /*
>   * fsverity op that reads and caches a merkle tree page.
>   *
> - * @inode:         inode to read a merkle tree page for
> - * @index:         page index relative to the start of the merkle tree
> - * @num_ra_pages:  number of pages to readahead. Optional, we ignore it
> - *
>   * The Merkle tree is stored in the filesystem btree, but its pages are cached
>   * with a logical position past EOF in the inode's mapping.
> - *
> - * Returns the page we read, or an ERR_PTR on error.
>   */
> -static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> -						pgoff_t index,
> -						unsigned long num_ra_pages)
> +static int btrfs_read_merkle_tree_block(const struct fsverity_readmerkle *req,
> +					struct fsverity_blockbuf *block)
>  {
> +	struct inode *inode = req->inode;
>  	struct folio *folio;
> -	u64 off = (u64)index << PAGE_SHIFT;
> +	u64 off = req->pos;
>  	loff_t merkle_pos = merkle_file_pos(inode);
> +	pgoff_t index;
>  	int ret;
>  
>  	if (merkle_pos < 0)
> -		return ERR_PTR(merkle_pos);
> +		return merkle_pos;
>  	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
> -		return ERR_PTR(-EFBIG);
> -	index += merkle_pos >> PAGE_SHIFT;
> +		return -EFBIG;
> +	index = (merkle_pos + off) >> PAGE_SHIFT;
>  again:
>  	folio = __filemap_get_folio(inode->i_mapping, index, FGP_ACCESSED, 0);
>  	if (!IS_ERR(folio)) {
>  		if (folio_test_uptodate(folio))
>  			goto out;
> @@ -733,28 +728,28 @@ static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
>  		folio_lock(folio);
>  		/* If it's not uptodate after we have the lock, we got a read error. */
>  		if (!folio_test_uptodate(folio)) {
>  			folio_unlock(folio);
>  			folio_put(folio);
> -			return ERR_PTR(-EIO);
> +			return -EIO;
>  		}
>  		folio_unlock(folio);
>  		goto out;
>  	}
>  
>  	folio = filemap_alloc_folio(mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS),
>  				    0);
>  	if (!folio)
> -		return ERR_PTR(-ENOMEM);
> +		return -ENOMEM;
>  
>  	ret = filemap_add_folio(inode->i_mapping, folio, index, GFP_NOFS);
>  	if (ret) {
>  		folio_put(folio);
>  		/* Did someone else insert a folio here? */
>  		if (ret == -EEXIST)
>  			goto again;
> -		return ERR_PTR(ret);
> +		return ret;
>  	}
>  
>  	/*
>  	 * Merkle item keys are indexed from byte 0 in the merkle tree.
>  	 * They have the form:
> @@ -763,20 +758,21 @@ static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
>  	 */
>  	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY, off,
>  			     folio_address(folio), PAGE_SIZE, &folio->page);
>  	if (ret < 0) {
>  		folio_put(folio);
> -		return ERR_PTR(ret);
> +		return ret;
>  	}
>  	if (ret < PAGE_SIZE)
>  		folio_zero_segment(folio, ret, PAGE_SIZE);
>  
>  	folio_mark_uptodate(folio);
>  	folio_unlock(folio);
>  
>  out:
> -	return folio_file_page(folio, index);
> +	fsverity_set_block_page(req, block, folio_file_page(folio, index));
> +	return 0;
>  }
>  
>  /*
>   * fsverity op that writes a Merkle tree block into the btree.
>   *
> @@ -800,11 +796,13 @@ static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
>  	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
>  			       pos, buf, size);
>  }
>  
>  const struct fsverity_operations btrfs_verityops = {
> +	.uses_page_based_merkle_caching = 1,
>  	.begin_enable_verity     = btrfs_begin_enable_verity,
>  	.end_enable_verity       = btrfs_end_enable_verity,
>  	.get_verity_descriptor   = btrfs_get_verity_descriptor,
> -	.read_merkle_tree_page   = btrfs_read_merkle_tree_page,
> +	.read_merkle_tree_block  = btrfs_read_merkle_tree_block,
> +	.drop_merkle_tree_block  = fsverity_drop_page_merkle_tree_block,
>  	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
>  };
> diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
> index 2f37e1ea3955..5a3a3991d661 100644
> --- a/fs/ext4/verity.c
> +++ b/fs/ext4/verity.c
> @@ -355,31 +355,33 @@ static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
>  			return err;
>  	}
>  	return desc_size;
>  }
>  
> -static struct page *ext4_read_merkle_tree_page(struct inode *inode,
> -					       pgoff_t index,
> -					       unsigned long num_ra_pages)
> +static int ext4_read_merkle_tree_block(const struct fsverity_readmerkle *req,
> +				       struct fsverity_blockbuf *block)
>  {
> +	struct inode *inode = req->inode;
> +	pgoff_t index = (req->pos +
> +			 ext4_verity_metadata_pos(inode)) >> PAGE_SHIFT;
> +	unsigned long num_ra_pages = req->ra_bytes >> PAGE_SHIFT;
>  	struct folio *folio;
>  
> -	index += ext4_verity_metadata_pos(inode) >> PAGE_SHIFT;
> -
>  	folio = __filemap_get_folio(inode->i_mapping, index, FGP_ACCESSED, 0);
>  	if (IS_ERR(folio) || !folio_test_uptodate(folio)) {
>  		DEFINE_READAHEAD(ractl, NULL, NULL, inode->i_mapping, index);
>  
>  		if (!IS_ERR(folio))
>  			folio_put(folio);
>  		else if (num_ra_pages > 1)
>  			page_cache_ra_unbounded(&ractl, num_ra_pages, 0);
>  		folio = read_mapping_folio(inode->i_mapping, index, NULL);
>  		if (IS_ERR(folio))
> -			return ERR_CAST(folio);
> +			return PTR_ERR(folio);
>  	}
> -	return folio_file_page(folio, index);
> +	fsverity_set_block_page(req, block, folio_file_page(folio, index));
> +	return 0;
>  }
>  
>  static int ext4_write_merkle_tree_block(struct inode *inode, const void *buf,
>  					u64 pos, unsigned int size)
>  {
> @@ -387,11 +389,13 @@ static int ext4_write_merkle_tree_block(struct inode *inode, const void *buf,
>  
>  	return pagecache_write(inode, buf, size, pos);
>  }
>  
>  const struct fsverity_operations ext4_verityops = {
> +	.uses_page_based_merkle_caching = 1,
>  	.begin_enable_verity	= ext4_begin_enable_verity,
>  	.end_enable_verity	= ext4_end_enable_verity,
>  	.get_verity_descriptor	= ext4_get_verity_descriptor,
> -	.read_merkle_tree_page	= ext4_read_merkle_tree_page,
> +	.read_merkle_tree_block	= ext4_read_merkle_tree_block,
> +	.drop_merkle_tree_block	= fsverity_drop_page_merkle_tree_block,
>  	.write_merkle_tree_block = ext4_write_merkle_tree_block,
>  };
> diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
> index f7bb0c54502c..859ab2d8d734 100644
> --- a/fs/f2fs/verity.c
> +++ b/fs/f2fs/verity.c
> @@ -252,31 +252,33 @@ static int f2fs_get_verity_descriptor(struct inode *inode, void *buf,
>  			return res;
>  	}
>  	return size;
>  }
>  
> -static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
> -					       pgoff_t index,
> -					       unsigned long num_ra_pages)
> +static int f2fs_read_merkle_tree_block(const struct fsverity_readmerkle *req,
> +				       struct fsverity_blockbuf *block)
>  {
> +	struct inode *inode = req->inode;
> +	pgoff_t index = (req->pos +
> +			 f2fs_verity_metadata_pos(inode)) >> PAGE_SHIFT;
> +	unsigned long num_ra_pages = req->ra_bytes >> PAGE_SHIFT;
>  	struct folio *folio;
>  
> -	index += f2fs_verity_metadata_pos(inode) >> PAGE_SHIFT;
> -
>  	folio = __filemap_get_folio(inode->i_mapping, index, FGP_ACCESSED, 0);
>  	if (IS_ERR(folio) || !folio_test_uptodate(folio)) {
>  		DEFINE_READAHEAD(ractl, NULL, NULL, inode->i_mapping, index);
>  
>  		if (!IS_ERR(folio))
>  			folio_put(folio);
>  		else if (num_ra_pages > 1)
>  			page_cache_ra_unbounded(&ractl, num_ra_pages, 0);
>  		folio = read_mapping_folio(inode->i_mapping, index, NULL);
>  		if (IS_ERR(folio))
> -			return ERR_CAST(folio);
> +			return PTR_ERR(folio);
>  	}
> -	return folio_file_page(folio, index);
> +	fsverity_set_block_page(req, block, folio_file_page(folio, index));
> +	return 0;
>  }
>  
>  static int f2fs_write_merkle_tree_block(struct inode *inode, const void *buf,
>  					u64 pos, unsigned int size)
>  {
> @@ -284,11 +286,13 @@ static int f2fs_write_merkle_tree_block(struct inode *inode, const void *buf,
>  
>  	return pagecache_write(inode, buf, size, pos);
>  }
>  
>  const struct fsverity_operations f2fs_verityops = {
> +	.uses_page_based_merkle_caching = 1,
>  	.begin_enable_verity	= f2fs_begin_enable_verity,
>  	.end_enable_verity	= f2fs_end_enable_verity,
>  	.get_verity_descriptor	= f2fs_get_verity_descriptor,
> -	.read_merkle_tree_page	= f2fs_read_merkle_tree_page,
> +	.read_merkle_tree_block	= f2fs_read_merkle_tree_block,
> +	.drop_merkle_tree_block	= fsverity_drop_page_merkle_tree_block,
>  	.write_merkle_tree_block = f2fs_write_merkle_tree_block,
>  };
> diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> index b3506f56e180..da8ba0d626d6 100644
> --- a/fs/verity/fsverity_private.h
> +++ b/fs/verity/fsverity_private.h
> @@ -45,10 +45,13 @@ struct merkle_tree_params {
>  	u8 log_blocks_per_page;		/* log2(blocks_per_page) */
>  	unsigned int num_levels;	/* number of levels in Merkle tree */
>  	u64 tree_size;			/* Merkle tree size in bytes */
>  	unsigned long tree_pages;	/* Merkle tree size in pages */
>  
> +	/* The hash of a merkle block-sized buffer of zeroes */
> +	u8 zero_digest[FS_VERITY_MAX_DIGEST_SIZE];
> +
>  	/*
>  	 * Starting block index for each tree level, ordered from leaf level (0)
>  	 * to root level ('num_levels - 1')
>  	 */
>  	unsigned long level_start[FS_VERITY_MAX_LEVELS];
> @@ -59,11 +62,11 @@ struct merkle_tree_params {
>   *
>   * When a verity file is first opened, an instance of this struct is allocated
>   * and stored in ->i_verity_info; it remains until the inode is evicted.  It
>   * caches information about the Merkle tree that's needed to efficiently verify
>   * data read from the file.  It also caches the file digest.  The Merkle tree
> - * pages themselves are not cached here, but the filesystem may cache them.
> + * blocks themselves are not cached here, but the filesystem may cache them.
>   */
>  struct fsverity_info {
>  	struct merkle_tree_params tree_params;
>  	u8 root_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	u8 file_digest[FS_VERITY_MAX_DIGEST_SIZE];
> @@ -150,8 +153,16 @@ static inline void fsverity_init_signature(void)
>  }
>  #endif /* !CONFIG_FS_VERITY_BUILTIN_SIGNATURES */
>  
>  /* verify.c */
>  
> +int fsverity_read_merkle_tree_block(struct inode *inode,
> +				    const struct merkle_tree_params *params,
> +				    int level, u64 pos, unsigned long ra_bytes,
> +				    struct fsverity_blockbuf *block);
> +
> +void fsverity_drop_merkle_tree_block(struct inode *inode,
> +				     struct fsverity_blockbuf *block);
> +
>  void __init fsverity_init_workqueue(void);
>  
>  #endif /* _FSVERITY_PRIVATE_H */
> diff --git a/fs/verity/open.c b/fs/verity/open.c
> index fdeb95eca3af..daa37007adfd 100644
> --- a/fs/verity/open.c
> +++ b/fs/verity/open.c
> @@ -10,10 +10,22 @@
>  #include <linux/mm.h>
>  #include <linux/slab.h>
>  
>  static struct kmem_cache *fsverity_info_cachep;
>  
> +/*
> + * If the filesystem caches Merkle tree blocks in the pagecache, and the Merkle
> + * tree block size differs from the page size, then a bitmap is needed to keep
> + * track of which hash blocks have been verified.
> + */
> +static bool needs_bitmap(const struct inode *inode,
> +			 const struct merkle_tree_params *params)
> +{
> +	return inode->i_sb->s_vop->uses_page_based_merkle_caching &&
> +		params->block_size != PAGE_SIZE;
> +}
> +
>  /**
>   * fsverity_init_merkle_tree_params() - initialize Merkle tree parameters
>   * @params: the parameters struct to initialize
>   * @inode: the inode for which the Merkle tree is being built
>   * @hash_algorithm: number of hash algorithm to use
> @@ -124,28 +136,36 @@ int fsverity_init_merkle_tree_params(struct merkle_tree_params *params,
>  		params->level_start[level] = offset;
>  		offset += blocks_in_level[level];
>  	}
>  
>  	/*
> -	 * With block_size != PAGE_SIZE, an in-memory bitmap will need to be
> -	 * allocated to track the "verified" status of hash blocks.  Don't allow
> -	 * this bitmap to get too large.  For now, limit it to 1 MiB, which
> -	 * limits the file size to about 4.4 TB with SHA-256 and 4K blocks.
> +	 * If an in-memory bitmap will need to be allocated to track the
> +	 * "verified" status of hash blocks, don't allow this bitmap to get too
> +	 * large.  For now, limit it to 1 MiB, which limits the file size to
> +	 * about 4.4 TB with SHA-256 and 4K blocks.
>  	 *
>  	 * Together with the fact that the data, and thus also the Merkle tree,
>  	 * cannot have more than ULONG_MAX pages, this implies that hash block
>  	 * indices can always fit in an 'unsigned long'.  But to be safe, we
>  	 * explicitly check for that too.  Note, this is only for hash block
>  	 * indices; data block indices might not fit in an 'unsigned long'.
>  	 */
> -	if ((params->block_size != PAGE_SIZE && offset > 1 << 23) ||
> +	if ((needs_bitmap(inode, params) && offset > 1 << 23) ||
>  	    offset > ULONG_MAX) {
>  		fsverity_err(inode, "Too many blocks in Merkle tree");
>  		err = -EFBIG;
>  		goto out_err;
>  	}
>  
> +	/* Calculate the digest of the all-zeroes block. */
> +	err = fsverity_hash_block(params, inode, page_address(ZERO_PAGE(0)),
> +				  params->zero_digest);
> +	if (err) {
> +		fsverity_err(inode, "Error %d computing zero digest", err);
> +		goto out_err;
> +	}
> +
>  	params->tree_size = offset << log_blocksize;
>  	params->tree_pages = PAGE_ALIGN(params->tree_size) >> PAGE_SHIFT;
>  	return 0;
>  
>  out_err:
> @@ -211,16 +231,14 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
>  	err = fsverity_verify_signature(vi, desc->signature,
>  					le32_to_cpu(desc->sig_size));
>  	if (err)
>  		goto fail;
>  
> -	if (vi->tree_params.block_size != PAGE_SIZE) {
> +	if (needs_bitmap(inode, &vi->tree_params)) {
>  		/*
> -		 * When the Merkle tree block size and page size differ, we use
> -		 * a bitmap to keep track of which hash blocks have been
> -		 * verified.  This bitmap must contain one bit per hash block,
> -		 * including alignment to a page boundary at the end.
> +		 * The bitmap must contain one bit per hash block, including
> +		 * alignment to a page boundary at the end.
>  		 *
>  		 * Eventually, to support extremely large files in an efficient
>  		 * way, it might be necessary to make pages of this bitmap
>  		 * reclaimable.  But for now, simply allocating the whole bitmap
>  		 * is a simple solution that works well on the files on which
> diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> index f58432772d9e..61f419df1ea1 100644
> --- a/fs/verity/read_metadata.c
> +++ b/fs/verity/read_metadata.c
> @@ -12,69 +12,59 @@
>  #include <linux/sched/signal.h>
>  #include <linux/uaccess.h>
>  
>  static int fsverity_read_merkle_tree(struct inode *inode,
>  				     const struct fsverity_info *vi,
> -				     void __user *buf, u64 offset, int length)
> +				     void __user *buf, u64 pos, int length)
>  {
> -	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> -	u64 end_offset;
> -	unsigned int offs_in_page;
> -	pgoff_t index, last_index;
> +	const struct merkle_tree_params *params = &vi->tree_params;
> +	const u64 end_pos = min(pos + length, params->tree_size);
> +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> +	const unsigned long max_ra_bytes =
> +		min_t(u64, (u64)bdi->io_pages << PAGE_SHIFT, ULONG_MAX);
> +	unsigned int offs_in_block = pos & (params->block_size - 1);
>  	int retval = 0;
>  	int err = 0;
>  
> -	end_offset = min(offset + length, vi->tree_params.tree_size);
> -	if (offset >= end_offset)
> -		return 0;
> -	offs_in_page = offset_in_page(offset);
> -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> -
>  	/*
> -	 * Iterate through each Merkle tree page in the requested range and copy
> -	 * the requested portion to userspace.  Note that the Merkle tree block
> -	 * size isn't important here, as we are returning a byte stream; i.e.,
> -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace.
>  	 */
> -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> -		unsigned long num_ra_pages =
> -			min_t(unsigned long, last_index - index + 1,
> -			      inode->i_sb->s_bdi->io_pages);
> -		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> -						   PAGE_SIZE - offs_in_page);
> -		struct page *page;
> -		const void *virt;
> -
> -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> -		if (IS_ERR(page)) {
> -			err = PTR_ERR(page);
> -			fsverity_err(inode,
> -				     "Error %d reading Merkle tree page %lu",
> -				     err, index);
> +	while (pos < end_pos) {
> +		unsigned long ra_bytes;
> +		unsigned int bytes_to_copy;
> +		struct fsverity_blockbuf block;
> +
> +		ra_bytes = min_t(u64, end_pos - pos, max_ra_bytes);
> +		bytes_to_copy = min_t(u64, end_pos - pos,
> +				      params->block_size - offs_in_block);
> +
> +		err = fsverity_read_merkle_tree_block(inode, params,
> +						      FSVERITY_STREAMING_READ,
> +						      pos - offs_in_block,
> +						      ra_bytes, &block);
> +		if (err)
>  			break;
> -		}
>  
> -		virt = kmap_local_page(page);
> -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> -			kunmap_local(virt);
> -			put_page(page);
> +		if (copy_to_user(buf, block.kaddr + offs_in_block,
> +				 bytes_to_copy)) {
> +			fsverity_drop_merkle_tree_block(inode, &block);
>  			err = -EFAULT;
>  			break;
>  		}
> -		kunmap_local(virt);
> -		put_page(page);
> +		fsverity_drop_merkle_tree_block(inode, &block);
>  
>  		retval += bytes_to_copy;
>  		buf += bytes_to_copy;
> -		offset += bytes_to_copy;
> +		pos += bytes_to_copy;
>  
>  		if (fatal_signal_pending(current))  {
>  			err = -EINTR;
>  			break;
>  		}
>  		cond_resched();
> -		offs_in_page = 0;
> +		offs_in_block = 0;
>  	}
>  	return retval ? retval : err;
>  }
>  
>  /* Copy the requested portion of the buffer to userspace. */
> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index 4fcad0825a12..aa6f5ca719b3 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -76,10 +76,131 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
>  	smp_wmb();
>  	SetPageChecked(hpage);
>  	return false;
>  }
>  
> +/**
> + * fsverity_set_block_page() - fill in a fsverity_blockbuf using a page
> + * @req: The Merkle tree block read request
> + * @block: The fsverity_blockbuf to initialize
> + * @page: The page containing the block's data at offset @req->pos % PAGE_SIZE.
> + *
> + * This is a helper function for filesystems that cache Merkle tree blocks in
> + * the pagecache.  It should be called at the end of
> + * fsverity_operations::read_merkle_tree_block().  It takes ownership of a ref
> + * to the page, maps the page, and uses the PG_checked flag and (if needed) the
> + * fsverity_info::hash_block_verified bitmap to check whether the block has been
> + * verified or not.  It initializes the fsverity_blockbuf accordingly.
> + *
> + * This must be paired with fsverity_drop_page_merkle_tree_block(), called from
> + * fsverity_operations::drop_merkle_tree_block().
> + */
> +void fsverity_set_block_page(const struct fsverity_readmerkle *req,
> +			     struct fsverity_blockbuf *block,
> +			     struct page *page)
> +{
> +	struct fsverity_info *vi = req->inode->i_verity_info;
> +
> +	block->kaddr = kmap_local_page(page) + (req->pos & ~PAGE_MASK);
> +	block->context = page;
> +	block->verified = is_hash_block_verified(vi, page, block->index);
> +}
> +EXPORT_SYMBOL_GPL(fsverity_set_block_page);
> +
> +/**
> + * fsverity_drop_page_merkle_tree_block() - drop a Merkle tree block for
> + *					    filesystems using page-based caching
> + * @inode: The inode to which the Merkle tree belongs
> + * @block: The fsverity_blockbuf to drop
> + *
> + * This pairs with fsverity_set_block_page().  It marks the block as verified if
> + * needed, and then it unmaps and puts the page.  Filesystems that use
> + * fsverity_set_block_page() need to set ->drop_merkle_tree_block to this.
> + */
> +void fsverity_drop_page_merkle_tree_block(struct inode *inode,
> +					  struct fsverity_blockbuf *block)
> +{
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	struct page *page = block->context;
> +
> +	if (block->newly_verified) {
> +		/*
> +		 * This must be atomic and idempotent, as the same hash block
> +		 * might be verified by multiple threads concurrently.
> +		 */
> +		if (vi->hash_block_verified != NULL)
> +			set_bit(block->index, vi->hash_block_verified);
> +		else
> +			SetPageChecked(page);
> +	}
> +	unmap_and_put_page(page, block->kaddr);
> +}
> +EXPORT_SYMBOL_GPL(fsverity_drop_page_merkle_tree_block);
> +
> +/**
> + * fsverity_read_merkle_tree_block() - read a Merkle tree block
> + * @inode: inode to which the Merkle tree belongs
> + * @params: inode's Merkle tree parameters
> + * @level: level of the block, or FSVERITY_STREAMING_READ to indicate a
> + *	   streaming read.  Level 0 means the leaf level.
> + * @pos: position of the block in the Merkle tree, in bytes
> + * @ra_bytes: on cache miss, try to read ahead this many bytes
> + * @block: struct in which the block is returned
> + *
> + * This function reads a block from a file's Merkle tree.  It must be paired
> + * with fsverity_drop_merkle_tree_block().
> + *
> + * Return: 0 on success, -errno on failure
> + */
> +int fsverity_read_merkle_tree_block(struct inode *inode,
> +				    const struct merkle_tree_params *params,
> +				    int level, u64 pos, unsigned long ra_bytes,
> +				    struct fsverity_blockbuf *block)
> +{
> +	struct fsverity_readmerkle req = {
> +		.inode = inode,
> +		.pos = pos,
> +		.size = params->block_size,
> +		.digest_size = params->digest_size,
> +		.level = level,
> +		.num_levels = params->num_levels,
> +		.ra_bytes = ra_bytes,
> +		.zero_digest = params->zero_digest,
> +	};
> +	int err;
> +
> +	memset(block, 0, sizeof(*block));
> +	block->index = pos >> params->log_blocksize;
> +
> +	err = inode->i_sb->s_vop->read_merkle_tree_block(&req, block);
> +	if (err)
> +		fsverity_err(inode, "Error %d reading Merkle tree block %lu",
> +			     err, block->index);
> +	block->newly_verified = false;
> +	return err;
> +}
> +
> +/**
> + * fsverity_drop_merkle_tree_block() - drop a Merkle tree block buffer
> + * @inode: inode to which the Merkle tree belongs
> + * @block: block buffer to be dropped
> + *
> + * This releases the resources that were acquired by
> + * fsverity_read_merkle_tree_block().  If the block is newly verified, it also
> + * saves a record of that in the appropriate location.  If a process nests the
> + * reads of multiple blocks, they must be dropped in reverse order; this is
> + * needed to accommodate the use of local kmaps to map the blocks' contents.
> + */
> +void fsverity_drop_merkle_tree_block(struct inode *inode,
> +				     struct fsverity_blockbuf *block)
> +{
> +	inode->i_sb->s_vop->drop_merkle_tree_block(inode, block);
> +
> +	block->context = NULL;
> +	block->kaddr = NULL;
> +}
> +
>  /*
>   * Verify a single data block against the file's Merkle tree.
>   *
>   * In principle, we need to verify the entire path to the root node.  However,
>   * for efficiency the filesystem may cache the hash blocks.  Therefore we need
> @@ -88,27 +209,24 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
>   *
>   * Return: %true if the data block is valid, else %false.
>   */
>  static bool
>  verify_data_block(struct inode *inode, struct fsverity_info *vi,
> -		  const void *data, u64 data_pos, unsigned long max_ra_pages)
> +		  const void *data, u64 data_pos, unsigned long max_ra_bytes)
>  {
>  	const struct merkle_tree_params *params = &vi->tree_params;
>  	const unsigned int hsize = params->digest_size;
>  	int level;
> +	unsigned long ra_bytes;
>  	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	const u8 *want_hash;
>  	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	/* The hash blocks that are traversed, indexed by level */
>  	struct {
> -		/* Page containing the hash block */
> -		struct page *page;
> -		/* Mapped address of the hash block (will be within @page) */
> -		const void *addr;
> -		/* Index of the hash block in the tree overall */
> -		unsigned long index;
> -		/* Byte offset of the wanted hash relative to @addr */
> +		/* Buffer containing the hash block */
> +		struct fsverity_blockbuf block;
> +		/* Byte offset of the wanted hash in the block */
>  		unsigned int hoffset;
>  	} hblocks[FS_VERITY_MAX_LEVELS];
>  	/*
>  	 * The index of the previous level's block within that level; also the
>  	 * index of that block's hash within the current level.
> @@ -141,86 +259,67 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  	 * until we reach the root.
>  	 */
>  	for (level = 0; level < params->num_levels; level++) {
>  		unsigned long next_hidx;
>  		unsigned long hblock_idx;
> -		pgoff_t hpage_idx;
> -		unsigned int hblock_offset_in_page;
> +		u64 hblock_pos;
>  		unsigned int hoffset;
> -		struct page *hpage;
> -		const void *haddr;
> +		struct fsverity_blockbuf *block = &hblocks[level].block;
>  
>  		/*
>  		 * The index of the block in the current level; also the index
>  		 * of that block's hash within the next level.
>  		 */
>  		next_hidx = hidx >> params->log_arity;
>  
>  		/* Index of the hash block in the tree overall */
>  		hblock_idx = params->level_start[level] + next_hidx;
>  
> -		/* Index of the hash page in the tree overall */
> -		hpage_idx = hblock_idx >> params->log_blocks_per_page;
> -
> -		/* Byte offset of the hash block within the page */
> -		hblock_offset_in_page =
> -			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
> +		/* Byte offset of the hash block in the tree overall */
> +		hblock_pos = (u64)hblock_idx << params->log_blocksize;
>  
>  		/* Byte offset of the hash within the block */
>  		hoffset = (hidx << params->log_digestsize) &
>  			  (params->block_size - 1);
>  
> -		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
> -				hpage_idx, level == 0 ? min(max_ra_pages,
> -					params->tree_pages - hpage_idx) : 0);
> -		if (IS_ERR(hpage)) {
> -			fsverity_err(inode,
> -				     "Error %ld reading Merkle tree page %lu",
> -				     PTR_ERR(hpage), hpage_idx);
> +		if (level == 0)
> +			ra_bytes = min_t(u64, max_ra_bytes,
> +					 params->tree_size - hblock_pos);
> +		else
> +			ra_bytes = 0;
> +
> +		if (fsverity_read_merkle_tree_block(inode, params, level,
> +						    hblock_pos, ra_bytes,
> +						    block) != 0)
>  			goto error;
> -		}
> -		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
> -		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
> -			memcpy(_want_hash, haddr + hoffset, hsize);
> +
> +		if (block->verified) {
> +			memcpy(_want_hash, block->kaddr + hoffset, hsize);
>  			want_hash = _want_hash;
> -			kunmap_local(haddr);
> -			put_page(hpage);
> +			fsverity_drop_merkle_tree_block(inode, block);
>  			goto descend;
>  		}
> -		hblocks[level].page = hpage;
> -		hblocks[level].addr = haddr;
> -		hblocks[level].index = hblock_idx;
>  		hblocks[level].hoffset = hoffset;
>  		hidx = next_hidx;
>  	}
>  
>  	want_hash = vi->root_hash;
>  descend:
>  	/* Descend the tree verifying hash blocks. */
>  	for (; level > 0; level--) {
> -		struct page *hpage = hblocks[level - 1].page;
> -		const void *haddr = hblocks[level - 1].addr;
> -		unsigned long hblock_idx = hblocks[level - 1].index;
> +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> +		const void *haddr = block->kaddr;
>  		unsigned int hoffset = hblocks[level - 1].hoffset;
>  
>  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
>  			goto error;
>  		if (memcmp(want_hash, real_hash, hsize) != 0)
>  			goto corrupted;
> -		/*
> -		 * Mark the hash block as verified.  This must be atomic and
> -		 * idempotent, as the same hash block might be verified by
> -		 * multiple threads concurrently.
> -		 */
> -		if (vi->hash_block_verified)
> -			set_bit(hblock_idx, vi->hash_block_verified);
> -		else
> -			SetPageChecked(hpage);
> +		block->newly_verified = true;
>  		memcpy(_want_hash, haddr + hoffset, hsize);
>  		want_hash = _want_hash;
> -		kunmap_local(haddr);
> -		put_page(hpage);
> +		fsverity_drop_merkle_tree_block(inode, block);
>  	}
>  
>  	/* Finally, verify the data block. */
>  	if (fsverity_hash_block(params, inode, data, real_hash) != 0)
>  		goto error;
> @@ -233,20 +332,18 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		     "FILE CORRUPTED! pos=%llu, level=%d, want_hash=%s:%*phN, real_hash=%s:%*phN",
>  		     data_pos, level - 1,
>  		     params->hash_alg->name, hsize, want_hash,
>  		     params->hash_alg->name, hsize, real_hash);
>  error:
> -	for (; level > 0; level--) {
> -		kunmap_local(hblocks[level - 1].addr);
> -		put_page(hblocks[level - 1].page);
> -	}
> +	for (; level > 0; level--)
> +		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
>  	return false;
>  }
>  
>  static bool
>  verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
> -		   unsigned long max_ra_pages)
> +		   unsigned long max_ra_bytes)
>  {
>  	struct inode *inode = data_folio->mapping->host;
>  	struct fsverity_info *vi = inode->i_verity_info;
>  	const unsigned int block_size = vi->tree_params.block_size;
>  	u64 pos = (u64)data_folio->index << PAGE_SHIFT;
> @@ -260,11 +357,11 @@ verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
>  		void *data;
>  		bool valid;
>  
>  		data = kmap_local_folio(data_folio, offset);
>  		valid = verify_data_block(inode, vi, data, pos + offset,
> -					  max_ra_pages);
> +					  max_ra_bytes);
>  		kunmap_local(data);
>  		if (!valid)
>  			return false;
>  		offset += block_size;
>  		len -= block_size;
> @@ -306,28 +403,29 @@ EXPORT_SYMBOL_GPL(fsverity_verify_blocks);
>   * All filesystems must also call fsverity_verify_page() on holes.
>   */
>  void fsverity_verify_bio(struct bio *bio)
>  {
>  	struct folio_iter fi;
> -	unsigned long max_ra_pages = 0;
> +	unsigned long max_ra_bytes = 0;
>  
>  	if (bio->bi_opf & REQ_RAHEAD) {
>  		/*
>  		 * If this bio is for data readahead, then we also do readahead
>  		 * of the first (largest) level of the Merkle tree.  Namely,
> -		 * when a Merkle tree page is read, we also try to piggy-back on
> -		 * some additional pages -- up to 1/4 the number of data pages.
> +		 * when there is a cache miss for a Merkle tree block, we try to
> +		 * piggy-back some additional blocks onto the read, with size up
> +		 * to 1/4 the size of the data being read.
>  		 *
>  		 * This improves sequential read performance, as it greatly
>  		 * reduces the number of I/O requests made to the Merkle tree.
>  		 */
> -		max_ra_pages = bio->bi_iter.bi_size >> (PAGE_SHIFT + 2);
> +		max_ra_bytes = bio->bi_iter.bi_size >> 2;
>  	}
>  
>  	bio_for_each_folio_all(fi, bio) {
>  		if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
> -					max_ra_pages)) {
> +					max_ra_bytes)) {
>  			bio->bi_status = BLK_STS_IOERR;
>  			break;
>  		}
>  	}
>  }
> diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
> index 1eb7eae580be..2b9137061379 100644
> --- a/include/linux/fsverity.h
> +++ b/include/linux/fsverity.h
> @@ -24,13 +24,77 @@
>  #define FS_VERITY_MAX_DIGEST_SIZE	SHA512_DIGEST_SIZE
>  
>  /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
>  #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
>  
> +/**
> + * struct fsverity_blockbuf - Merkle tree block buffer
> + * @context: filesystem private context
> + * @kaddr: virtual address of the block's data
> + * @index: index of the block in the Merkle tree
> + * @verified: was this block already verified when it was requested?
> + * @newly_verified: was verification of this block just done?
> + *
> + * This struct describes a buffer containing a Merkle tree block.  When a Merkle
> + * tree block needs to be read, this struct is passed to the filesystem's
> + * ->read_merkle_tree_block function, with just the @index field set.  The
> + * filesystem sets @kaddr, and optionally @context and @verified.  Filesystems
> + * must set @verified only if the filesystem was previously told that the same
> + * block was verified (via ->drop_merkle_tree_block() seeing @newly_verified)
> + * and the block wasn't evicted from cache in the intervening time.
> + *
> + * To release the resources acquired by a read, this struct is passed to
> + * ->drop_merkle_tree_block, with @newly_verified set if verification of the
> + * block was just done.
> + */
> +struct fsverity_blockbuf {
> +	void *context;
> +	void *kaddr;
> +	unsigned long index;
> +	unsigned int verified : 1;
> +	unsigned int newly_verified : 1;
> +};
> +
> +/**
> + * struct fsverity_readmerkle - Request to read a Merkle tree block
> + * @inode: inode to which the Merkle tree belongs
> + * @pos: position of the block in the Merkle tree, in bytes
> + * @size: size of the Merkle tree block, in bytes
> + * @digest_size: size of zero_digest, in bytes
> + * @level: level of the block, or FSVERITY_STREAMING_READ to indicate a
> + *	   streaming read.  Level 0 means the leaf level.
> + * @num_levels: number of levels in the tree total
> + * @ra_bytes: number of bytes that should be prefetched starting at @pos if the
> + *	      block isn't already cached.  Implementations may ignore this
> + *	      argument; it's only a performance optimization.
> + * @zero_digest: hash of a merkle block-sized buffer of zeroes
> + */
> +struct fsverity_readmerkle {
> +	struct inode *inode;
> +	u64 pos;
> +	unsigned int size;
> +	unsigned int digest_size;
> +	int level;
> +	int num_levels;
> +	unsigned long ra_bytes;
> +	const u8 *zero_digest;
> +};
> +
> +#define FSVERITY_STREAMING_READ	(-1)
> +
>  /* Verity operations for filesystems */
>  struct fsverity_operations {
>  
> +	/**
> +	 * This must be set if the filesystem chooses to cache Merkle tree
> +	 * blocks in the pagecache, i.e. if it uses fsverity_set_block_page()
> +	 * and fsverity_drop_page_merkle_tree_block().  It causes the allocation
> +	 * of the bitmap needed by those helper functions when the Merkle tree
> +	 * block size is less than the page size.
> +	 */
> +	unsigned int uses_page_based_merkle_caching : 1;
> +
>  	/**
>  	 * Begin enabling verity on the given file.
>  	 *
>  	 * @filp: a readonly file descriptor for the file
>  	 *
> @@ -83,29 +147,46 @@ struct fsverity_operations {
>  	 */
>  	int (*get_verity_descriptor)(struct inode *inode, void *buf,
>  				     size_t bufsize);
>  
>  	/**
> -	 * Read a Merkle tree page of the given inode.
> +	 * Read a Merkle tree block of the given inode.
>  	 *
> -	 * @inode: the inode
> -	 * @index: 0-based index of the page within the Merkle tree
> -	 * @num_ra_pages: The number of Merkle tree pages that should be
> -	 *		  prefetched starting at @index if the page at @index
> -	 *		  isn't already cached.  Implementations may ignore this
> -	 *		  argument; it's only a performance optimization.
> +	 * @req: read request; see struct fsverity_readmerkle
> +	 * @block: struct in which the filesystem returns the block.
> +	 *	   It also contains the block index.
>  	 *
>  	 * This can be called at any time on an open verity file.  It may be
> -	 * called by multiple processes concurrently, even with the same page.
> +	 * called by multiple processes concurrently.
> +	 *
> +	 * Implementations of this function should cache the Merkle tree blocks
> +	 * and issue I/O only if the block isn't already cached.  The filesystem
> +	 * can implement a custom cache or use the pagecache based helpers.
> +	 *
> +	 * Return: 0 on success, -errno on failure
> +	 */
> +	int (*read_merkle_tree_block)(const struct fsverity_readmerkle *req,
> +				      struct fsverity_blockbuf *block);
> +
> +	/**
> +	 * Release a Merkle tree block buffer.
> +	 *
> +	 * @inode: the inode the block is being dropped for
> +	 * @block: the block buffer to release
>  	 *
> -	 * Note that this must retrieve a *page*, not necessarily a *block*.
> +	 * This is called to release a Merkle tree block that was obtained with
> +	 * ->read_merkle_tree_block().  If multiple reads were nested, the drops
> +	 * are done in reverse order (to accommodate the use of local kmaps).
>  	 *
> -	 * Return: the page on success, ERR_PTR() on failure
> +	 * If @block->newly_verified is true, then implementations of this
> +	 * function should cache a flag saying that the block is verified, and
> +	 * return that flag from later ->read_merkle_tree_block() for the same
> +	 * block if the block hasn't been evicted from the cache in the
> +	 * meantime.  This avoids unnecessary revalidation of blocks.
>  	 */
> -	struct page *(*read_merkle_tree_page)(struct inode *inode,
> -					      pgoff_t index,
> -					      unsigned long num_ra_pages);
> +	void (*drop_merkle_tree_block)(struct inode *inode,
> +				       struct fsverity_blockbuf *block);
>  
>  	/**
>  	 * Write a Merkle tree block to the given inode.
>  	 *
>  	 * @inode: the inode for which the Merkle tree is being built
> @@ -168,10 +249,15 @@ static inline void fsverity_cleanup_inode(struct inode *inode)
>  
>  int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg);
>  
>  /* verify.c */
>  
> +void fsverity_set_block_page(const struct fsverity_readmerkle *req,
> +			     struct fsverity_blockbuf *block,
> +			     struct page *page);
> +void fsverity_drop_page_merkle_tree_block(struct inode *inode,
> +					  struct fsverity_blockbuf *block);
>  bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset);
>  void fsverity_verify_bio(struct bio *bio);
>  void fsverity_enqueue_verify_work(struct work_struct *work);
>  
>  #else /* !CONFIG_FS_VERITY */
> 
> base-commit: a5131c3fdf2608f1c15f3809e201cf540eb28489
> -- 
> 2.45.0
> 
> 




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux