On Thu, Apr 04, 2024 at 10:31:45PM -0400, Eric Biggers wrote: > On Fri, Mar 29, 2024 at 05:33:27PM -0700, Darrick J. Wong wrote: > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h > > index b3506f56e180b..c9d97c2bebd84 100644 > > --- a/fs/verity/fsverity_private.h > > +++ b/fs/verity/fsverity_private.h > > @@ -154,4 +154,41 @@ static inline void fsverity_init_signature(void) > > > > void __init fsverity_init_workqueue(void); > > > > +static inline bool fsverity_caches_blocks(const struct inode *inode) > > +{ > > + const struct fsverity_operations *vops = inode->i_sb->s_vop; > > + > > + WARN_ON_ONCE(vops->read_merkle_tree_block && > > + !vops->drop_merkle_tree_block); > > + > > + return vops->read_merkle_tree_block != NULL; > > +} > > + > > +static inline bool fsverity_uses_bitmap(const struct fsverity_info *vi, > > + const struct inode *inode) > > +{ > > + if (fsverity_caches_blocks(inode)) > > + return false; > > + > > + /* > > + * If fs uses block-based Merkle tree caching, then fs-verity must use > > + * hash_block_verified bitmap as there's no page to mark it with > > + * PG_checked. > > + */ > > + return vi->tree_params.block_size != PAGE_SIZE; > > +} > > For checking whether the bitmap is in use, it's simpler and more efficient to > just directly check whether vi->hash_block_verified is NULL, as the existing > code does. Only the code that allocates the bitmap actually needs to be aware > of the details of when the bitmap gets enabled. > > fsverity_caches_blocks() has a similar issue, where it could just be replaced > with checking vops->read_merkle_tree_block != NULL directly (or equivalently > vops->drop_merkle_tree_block, which works well in > fsverity_drop_merkle_tree_block() since that's the function pointer it's > calling). The WARN_ON_ONCE() should be done in fsverity_create_info(), not > inlined multiple times into the verification code. Ok, I'll move the WARN_ON_ONCE there: /* * If the filesystem implementation supplies Merkle tree content on a * per-block basis, it must implement both the read and drop functions. * If it supplies content on a per-page basis, neither should be * provided. */ if (vops->read_merkle_tree_block) WARN_ON_ONCE(vops->drop_merkle_tree_block == NULL); else WARN_ON_ONCE(vops->drop_merkle_tree_block != NULL); > (I think the name "fsverity_caches_blocks()" is also confusing because the > fsverity support layer does not cache blocks, but all the filesystems do. What > it's actually trying to convey is a difference in how the filesystem caches the > blocks, which I don't think it does a good job at.) Ok, I'll drop both and open-code them with nicer comments. > > +int fsverity_read_merkle_tree_block(struct inode *inode, > > + const struct merkle_tree_params *params, > > + u64 pos, unsigned long ra_bytes, > > + struct fsverity_blockbuf *block); > > + > > +/* > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to > > + * filesystem if ->drop_merkle_tree_block() is set, otherwise, drop the > > + * reference in the block->context. > > "drop the reference" => "drops the reference" Fixed. > > + /* > > + * If fs uses block-based Merkle tree cachin, then fs-verity must use > > + * hash_block_verified bitmap as there's no page to mark it with > > + * PG_checked. > > + */ > > + if (fsverity_uses_bitmap(vi, inode)) { > > The comment contradicts the code. Urrrk, yes, it does. Sorry about that. /* * If the fs supplies Merkle tree blocks on a per-page basis and the * page size doesn't match the block size, fs-verity must use the * hash_block_verified bitmap instead of PG_checked. */ if (vops->read_merkle_tree_block == NULL && vi->tree_params.block_size != PAGE_SIZE) { ...allocate bitmap... } > > > diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c > > index f58432772d9ea..94fffa060f829 100644 > > --- a/fs/verity/read_metadata.c > > +++ b/fs/verity/read_metadata.c > > @@ -14,65 +14,60 @@ > > > > static int fsverity_read_merkle_tree(struct inode *inode, > > const struct fsverity_info *vi, > > - void __user *buf, u64 offset, int length) > > + void __user *buf, u64 pos, int length) > > { > > - const struct fsverity_operations *vops = inode->i_sb->s_vop; > > - u64 end_offset; > > - unsigned int offs_in_page; > > - pgoff_t index, last_index; > > + const u64 end_pos = min(pos + length, vi->tree_params.tree_size); > > + const struct merkle_tree_params *params = &vi->tree_params; > > + unsigned int offs_in_block = pos & (params->block_size - 1); > > int retval = 0; > > int err = 0; > > > > - end_offset = min(offset + length, vi->tree_params.tree_size); > > - if (offset >= end_offset) > > + if (pos >= end_pos) > > return 0; > > The above 'pos >= end_pos' check is no longer necessary. Aha, good catch. > > /* > > - * Iterate through each Merkle tree page in the requested range and copy > > - * the requested portion to userspace. Note that the Merkle tree block > > - * size isn't important here, as we are returning a byte stream; i.e., > > - * we can just work with pages even if the tree block size != PAGE_SIZE. > > + * Iterate through each Merkle tree block in the requested range and > > + * copy the requested portion to userspace. Note that we are returning > > + * a byte stream. > > */ > > - for (index = offset >> PAGE_SHIFT; index <= last_index; index++) { > > - unsigned long num_ra_pages = > > - min_t(unsigned long, last_index - index + 1, > > - inode->i_sb->s_bdi->io_pages); > > - unsigned int bytes_to_copy = min_t(u64, end_offset - offset, > > - PAGE_SIZE - offs_in_page); > > - struct page *page; > > - const void *virt; > > + while (pos < end_pos) { > > + unsigned long ra_bytes; > > + unsigned int bytes_to_copy; > > + struct fsverity_blockbuf block = { > > + .size = params->block_size, > > + }; > > > > - page = vops->read_merkle_tree_page(inode, index, num_ra_pages); > > - if (IS_ERR(page)) { > > - err = PTR_ERR(page); > > + ra_bytes = min_t(unsigned long, end_pos - pos + 1, > > + inode->i_sb->s_bdi->io_pages << PAGE_SHIFT); > > This introduces an off-by-one error in the calculation of ra_bytes. end_pos > is exclusive, but the calculation of ra_bytes assumes it is inclusive. > > Also, might io_pages << PAGE_SHIFT overflow an unsigned long? I hope not -- max io and readahead should never be 4GB. I'm a little surprised this isn't s_bdi->ra_pages. struct backing_dev_info *bdi = inode->i_sb->s_bdi; const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT, ULONG_MAX); while (pos < end_pos) { ra_bytes = min_t(unsigned long, end_pos - pos, max_ra_bytes); > > + bytes_to_copy = min_t(u64, end_pos - pos, > > + params->block_size - offs_in_block); > > + > > + err = fsverity_read_merkle_tree_block(inode, &vi->tree_params, > > + pos - offs_in_block, ra_bytes, &block); > > + if (err) { > > fsverity_err(inode, > > - "Error %d reading Merkle tree page %lu", > > - err, index); > > + "Error %d reading Merkle tree block %llu", > > + err, pos); > > break; > > The error message should go into fsverity_read_merkle_tree_block() so that it > does not need to be duplicated in its two callers. This would, additionally, > eliminate the need to introduce the 'err' variable in verify_data_block(). Do you want that to be a separate cleanup patch? > > diff --git a/fs/verity/verify.c b/fs/verity/verify.c > > index 4fcad0825a120..0b5e11073e883 100644 > > --- a/fs/verity/verify.c > > +++ b/fs/verity/verify.c > > @@ -13,14 +13,27 @@ > > static struct workqueue_struct *fsverity_read_workqueue; > > > > /* > > - * Returns true if the hash block with index @hblock_idx in the tree, located in > > - * @hpage, has already been verified. > > + * Returns true if the hash block with index @hblock_idx in the tree has > > + * already been verified. > > */ > > -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage, > > +static bool is_hash_block_verified(struct inode *inode, > > + struct fsverity_blockbuf *block, > > unsigned long hblock_idx) > > The comment should be updated to mention @block. I'll mention @inode too: /* * Returns true if the hash @block with index @hblock_idx in the merkle tree * for @inode has already been verified. */ > > @@ -143,11 +156,9 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi, > > for (level = 0; level < params->num_levels; level++) { > > unsigned long next_hidx; > > unsigned long hblock_idx; > > - pgoff_t hpage_idx; > > - unsigned int hblock_offset_in_page; > > + u64 hblock_pos; > > unsigned int hoffset; > > - struct page *hpage; > > - const void *haddr; > > + struct fsverity_blockbuf *block = &hblocks[level].block; > > > > /* > > * The index of the block in the current level; also the index > > @@ -158,36 +169,34 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi, > > /* Index of the hash block in the tree overall */ > > hblock_idx = params->level_start[level] + next_hidx; > > > > - /* Index of the hash page in the tree overall */ > > - hpage_idx = hblock_idx >> params->log_blocks_per_page; > > - > > - /* Byte offset of the hash block within the page */ > > - hblock_offset_in_page = > > - (hblock_idx << params->log_blocksize) & ~PAGE_MASK; > > + /* Byte offset of the hash block in the tree overall */ > > + hblock_pos = hblock_idx << params->log_blocksize; > > 'hblock_idx << params->log_blocksize' may overflow an unsigned long, so > 'hblock_idx' needs to be cast to u64 before doing the shift. Fixed. > > + if (level == 0) > > + ra_bytes = min(max_ra_bytes, > > + params->tree_size - hblock_pos); > > + else > > + ra_bytes = 0; > > The first argument to min() has type unsigned long, and the second has type u64. > Doesn't this generate a warning on 32-bit systems? Probably. I'll turn it into a min_t(u64...). > > @@ -325,7 +333,7 @@ void fsverity_verify_bio(struct bio *bio) > > > > bio_for_each_folio_all(fi, bio) { > > if (!verify_data_blocks(fi.folio, fi.length, fi.offset, > > - max_ra_pages)) { > > + max_ra_pages << PAGE_SHIFT)) { > > bio->bi_status = BLK_STS_IOERR; > > break; > > This function should calculate max_ra_bytes as bio->bi_iter.bi_size >> 2. It's > not necessary to convert to pages, then back to bytes again. Aha, Andrey and I missed that. Fixed. > > +/** > > + * fsverity_read_merkle_tree_block() - read Merkle tree block > > + * @inode: inode to which this Merkle tree blocks belong > > "this Merkle tree blocks belong" => "this Merkle tree block belongs" > > > + * @params: merkle tree parameters > > + * @pos: byte position within merkle tree > > + * @ra_bytes: try to read ahead this many btes > > "btes" => "bytes" Both fixed. > > +int fsverity_read_merkle_tree_block(struct inode *inode, > > + const struct merkle_tree_params *params, > > + u64 pos, unsigned long ra_bytes, > > + struct fsverity_blockbuf *block) > > +{ > > + const struct fsverity_operations *vops = inode->i_sb->s_vop; > > + unsigned long page_idx; > > + struct page *page; > > + unsigned long index; > > + unsigned int offset_in_page; > > + > > + if (fsverity_caches_blocks(inode)) { > > + block->verified = false; > > + return vops->read_merkle_tree_block(inode, pos, ra_bytes, > > + params->log_blocksize, block); > > + } > > + > > + index = pos >> params->log_blocksize; > > Should the fourth parameter to ->read_merkle_tree_block be the block index > (which is computed above) instead of log_blocksize? XFS only uses > params->log_blocksize to compute the block index anyway. I don't know. XFS is infamous for everyone having a different opinion and developers being unable to drive a consensus and being able to get a patchset to completion. So: Andrey wrote an implementation that used the buffer cache and used block indexes to load/store from the xattr structure. I didn't like that because layering violation. willy suggested hanging an xarray off struct xfs_inode and using that to cache merkle tree blocks. For that design, we want the xarray indexes in units of blocks to conserve space, which means passing in (pos, log_blocksize) or a direct block index. Christoph then suggested using a per-AG rhashtable to reduce per-inode memory overhead, in which case the lookup key can be either (inumber, pos) or (inumber, block index). This is a better design if there aren't really going to be that many xfs inodes with verity enabled, though we lose per-inode sharding of the merkle blocks. Dave thinks that verity inodes could constitute the majority of xfs inodes some day and it should be in the inode instead. Andrey and I do not have crystal balls, we have no idea how this dchinner/hch disagreement will play out. Earlier I got the sense that you wanted to move towards expressing all the merkle tree info in units of bytes. In a lot of ways it would be preferable to move to block indexes instead, since there's never going to be a meaningful access to merkle position 1337. But you're the maintainer, so it's up to you. :) > > +/** > > + * struct fsverity_blockbuf - Merkle Tree block buffer > > + * @kaddr: virtual address of the block's data > > + * @offset: block's offset into Merkle tree > > + * @size: the Merkle tree block size > > + * @context: filesystem private context > > + * @verified: has this buffer been validated? > > + * > > + * Buffer containing single Merkle Tree block. These buffers are passed > > + * - to filesystem, when fs-verity is building merkel tree, > > + * - from filesystem, when fs-verity is reading merkle tree from a disk. > > + * Filesystems sets kaddr together with size to point to a memory which contains > > + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be > > + * written down to disk. > > This comment still incorrectly claims that fsverity_blockbuf is being used for > writes. Oops, I forgot to delete that. --D > > - Eric >