On Thu, May 02, 2024 at 12:42:31AM +0000, Eric Biggers wrote: > On Wed, May 01, 2024 at 03:33:03PM -0700, Darrick J. Wong wrote: > > On Wed, May 01, 2024 at 12:33:14AM -0700, Christoph Hellwig wrote: > > > > + const u64 end_pos = min(pos + length, vi->tree_params.tree_size); > > > > + struct backing_dev_info *bdi = inode->i_sb->s_bdi; > > > > + const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT, > > > > + ULONG_MAX); > > > > + const struct merkle_tree_params *params = &vi->tree_params; > > > > > > bdi->io_pages is really a VM readahead concept. I know this is existing > > > code, but can we rething why this is even used here? > > > > I would get rid of it entirely for the merkle-by-block case, since we'd > > have to walk the xattr tree again just to find the next block. XFS > > ignores the readahead value entirely. > > > > I think this only makes sense for the merkle-by-page case, and only > > because ext4 and friends are stuffing the merkle data in the posteof > > parts of the file mapping. > > > > And even then, shouldn't we figure out the amount of readahead going on > > and only ask for enough readahead of the merkle tree to satisfy that > > readahead? > > The existing code is: > > unsigned long num_ra_pages = > min_t(unsigned long, last_index - index + 1, > inode->i_sb->s_bdi->io_pages); > > So it does limit the readahead amount to the amount remaining to be read. > > In addition, it's limited to io_pages. It's possible that's not the best value > to use (maybe it should be ra_pages?), but the intent was to just use a large > readahead size, since this code is doing a fully sequential read. io_pages is supposed to be the optimal IO size, whereas ra_pages is the readahead size for the block device. I don't know why you chose io_pages, but I'm assuming there's a reason there. :) Somewhat confusingly, I think mm/readahead.c picks the maximum of io_pages and ra_pages, which doesn't clear things up for me either. Personally I think fsverity should be using ra_pages here, but changing it should be a different patch with a separate justification. This patch simply has to translate the merkle-by-page code to handle by-block. > I do think that the concept of Merkle tree readahead makes sense regardless of > how the blocks are being stored. Having to go to disk every time a new 4K > Merkle tree block is needed increases read latencies. It doesn't need to be > included in the initial implementation though. Of course, if we're really ok with xfs making a giant left turn and storing the entire merkle tree as one big chunk of file range in the attr fork, then suddenly it *does* make sense to allow merkle tree readahead again. > > > And the returned/passed value should be a kernel pointer to the start > > > of the in-memory copy of the block? > > > to > > > > <shrug> This particular callsite is reading merkle data on behalf of an > > ioctl that exports data. Maybe we want the filesystem's errors to be > > bounced up to userspace? > > Yes, I think so. Ok, thanks for confirming that. > > > > +static bool is_hash_block_verified(struct inode *inode, > > > > + struct fsverity_blockbuf *block, > > > > unsigned long hblock_idx) > > > > > > Other fsverify code seems to use the (IMHO) much more readable > > > two-tab indentation for prototype continuations, maybe stick to that? > > > > I'll do that, if Eric says so. :) > > My preference is to align continuations with the line that they're continuing: > > static bool is_hash_block_verified(struct inode *inode, > struct fsverity_blockbuf *block, > unsigned long hblock_idx) > > > > > > > > > { > > > > + struct fsverity_info *vi = inode->i_verity_info; > > > > + struct page *hpage = (struct page *)block->context; > > > > > > block->context is a void pointer, no need for casting it. > > > > Eric insisted on it: > > https://lore.kernel.org/linux-xfs/20240306035622.GA68962@sol.localdomain/ > > No, I didn't. It showed up in some code snippets that I suggested, but the > casts originated from the patch itself. Leaving out the cast is fine with me. Oh ok. I'll drop those then. > > > > > > + for (; level > 0; level--) > > > > + fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block); > > > > > > Overlh long line here. But the loop kinda looks odd anyway with the > > > exta one off in the body instead of the loop. > > > > I /think/ that's a side effect of reusing the value of @level after the > > first loop fails as the initial conditions of the unwind loop. AFAICT > > it doesn't leak, but it's not entirely straightforward. > > When an error occurs either ascending or descending the tree, we end up here > with 'level' containing the number of levels that need to be cleaned up. It > might be clearer if it was called 'num_levels', though that could be confused > with 'params->num_levels'. Or we could use: 'while (level-- > 0)'. > > This is unrelated to this patch though. <nod> --D > - Eric >