Re: [PATCH 03/18] fsverity: convert verification to use byte instead of page offsets

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 8 May 2024 13:14:29 -0700

On Thu, May 02, 2024 at 12:42:31AM +0000, Eric Biggers wrote:
> On Wed, May 01, 2024 at 03:33:03PM -0700, Darrick J. Wong wrote:
> > On Wed, May 01, 2024 at 12:33:14AM -0700, Christoph Hellwig wrote:
> > > > +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> > > > +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> > > > +	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
> > > > +				     ULONG_MAX);
> > > > +	const struct merkle_tree_params *params = &vi->tree_params;
> > > 
> > > bdi->io_pages is really a VM readahead concept.  I know this is existing
> > > code, but can we rething why this is even used here?
> > 
> > I would get rid of it entirely for the merkle-by-block case, since we'd
> > have to walk the xattr tree again just to find the next block.  XFS
> > ignores the readahead value entirely.
> > 
> > I think this only makes sense for the merkle-by-page case, and only
> > because ext4 and friends are stuffing the merkle data in the posteof
> > parts of the file mapping.
> > 
> > And even then, shouldn't we figure out the amount of readahead going on
> > and only ask for enough readahead of the merkle tree to satisfy that
> > readahead?
> 
> The existing code is:
> 
>                 unsigned long num_ra_pages =
>                         min_t(unsigned long, last_index - index + 1,
>                               inode->i_sb->s_bdi->io_pages);
> 
> So it does limit the readahead amount to the amount remaining to be read.
> 
> In addition, it's limited to io_pages.  It's possible that's not the best value
> to use (maybe it should be ra_pages?), but the intent was to just use a large
> readahead size, since this code is doing a fully sequential read.

io_pages is supposed to be the optimal IO size, whereas ra_pages is the
readahead size for the block device.  I don't know why you chose
io_pages, but I'm assuming there's a reason there. :)

Somewhat confusingly, I think mm/readahead.c picks the maximum of
io_pages and ra_pages, which doesn't clear things up for me either.

Personally I think fsverity should be using ra_pages here, but changing
it should be a different patch with a separate justification.  This
patch simply has to translate the merkle-by-page code to handle by-block.

> I do think that the concept of Merkle tree readahead makes sense regardless of
> how the blocks are being stored.  Having to go to disk every time a new 4K
> Merkle tree block is needed increases read latencies.  It doesn't need to be
> included in the initial implementation though.

Of course, if we're really ok with xfs making a giant left turn and
storing the entire merkle tree as one big chunk of file range in the
attr fork, then suddenly it *does* make sense to allow merkle tree
readahead again.

> > > And the returned/passed value should be a kernel pointer to the start
> > > of the in-memory copy of the block?
> > > to 
> > 
> > <shrug> This particular callsite is reading merkle data on behalf of an
> > ioctl that exports data.  Maybe we want the filesystem's errors to be
> > bounced up to userspace?
> 
> Yes, I think so.

Ok, thanks for confirming that.

> > > > +static bool is_hash_block_verified(struct inode *inode,
> > > > +				   struct fsverity_blockbuf *block,
> > > >  				   unsigned long hblock_idx)
> > > 
> > > Other fsverify code seems to use the (IMHO) much more readable
> > > two-tab indentation for prototype continuations, maybe stick to that?
> > 
> > I'll do that, if Eric says so. :)
> 
> My preference is to align continuations with the line that they're continuing:
> 
> static bool is_hash_block_verified(struct inode *inode,
> 				   struct fsverity_blockbuf *block,
> 				   unsigned long hblock_idx)
> 
> > > >
> > > >  {
> > > > +	struct fsverity_info *vi = inode->i_verity_info;
> > > > +	struct page *hpage = (struct page *)block->context;
> > > 
> > > block->context is a void pointer, no need for casting it.
> > 
> > Eric insisted on it:
> > https://lore.kernel.org/linux-xfs/20240306035622.GA68962@sol.localdomain/
> 
> No, I didn't.  It showed up in some code snippets that I suggested, but the
> casts originated from the patch itself.  Leaving out the cast is fine with me.

Oh ok.  I'll drop those then.

> > 
> > > > +	for (; level > 0; level--)
> > > > +		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
> > > 
> > > Overlh long line here.  But the loop kinda looks odd anyway with the
> > > exta one off in the body instead of the loop.
> > 
> > I /think/ that's a side effect of reusing the value of @level after the
> > first loop fails as the initial conditions of the unwind loop.  AFAICT
> > it doesn't leak, but it's not entirely straightforward.
> 
> When an error occurs either ascending or descending the tree, we end up here
> with 'level' containing the number of levels that need to be cleaned up.  It
> might be clearer if it was called 'num_levels', though that could be confused
> with 'params->num_levels'.  Or we could use: 'while (level-- > 0)'.
> 
> This is unrelated to this patch though.

<nod>

--D

> - Eric
>