On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote: > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote: > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote: > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h > > > index b3506f56e180..dad33e6ff0d6 100644 > > > --- a/fs/verity/fsverity_private.h > > > +++ b/fs/verity/fsverity_private.h > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void) > > > > > > void __init fsverity_init_workqueue(void); > > > > > > +/* > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the > > > + * block->context. > > > + */ > > > +void fsverity_drop_block(struct inode *inode, > > > + struct fsverity_blockbuf *block); > > > + > > > #endif /* _FSVERITY_PRIVATE_H */ > > > > This should be paired with a helper function that reads a Merkle tree block by > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed. Besides > > being consistent with having a helper function for drop, this would prevent code > > duplication between verify_data_block() and fsverity_read_merkle_tree(). > > > > I recommend that it look like this: > > > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos, > > unsigned long ra_bytes, > > struct fsverity_blockbuf *block); > > > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes' > > would be the number of bytes for the filesystem to (optionally) readahead if the > > block is not yet cached. I think that things work out simpler if these values > > are measured in bytes, not blocks. 'block' would be at the end because it's an > > output, and it can be confusing to interleave inputs and outputs in parameters. > > FWIW I don't really like 'pos' here because that's usually short for > "file position", which is a byte, and this looks a lot more like a > merkle tree block number. > > u64 blkno? > > Or better yet use a typedef ("merkle_blkno_t") to make it really clear > when we're dealing with a tree block number. Ignore checkpatch > complaining about typeedefs. :) My suggestion is for 'pos' to be a byte position, in alignment with the pagecache naming convention as well as ->write_merkle_tree_block which currently uses a 'u64 pos' byte position too. It would also work for it to be a block index, in which case it should be named 'index' or 'blkno' and have type unsigned long. I *think* that things work out a bit cleaner if it's a byte position, but I could be convinced that it's actually better for it to be a block index. > > How about changing the prototype to: > > > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos); > > > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or > > can it happen in cases like filesystem corruption? If it can only happen due to > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message. > > I think XFS only passes to _invalidate_* the same pos that was passed to > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption > problem. > > Perhaps this function ought to note that @pos is supposed to be the same > value that was given to ->read_merkle_tree_block? > > Or: make the implementations return 1 for "reloaded from disk", 0 for > "still in cache", or a negative error code. Then fsverity can call > the invalidation routine itself and XFS doesn't have to worry about this > part. > > (I think? I have questions about the xfs_invalidate_blocks function.) It looks like XFS can invalidate blocks other than the one being read by ->read_merkle_tree_block. If it really was only a matter of the single block being read, then it would indeed be simpler to just make it a piece of information returned from ->read_merkle_tree_block. If the generic invalidation function is needed, it needs to be clearly documented when filesystems are expected to invalidate blocks. > > > + > > > + /** > > > + * Release the reference to a Merkle tree block > > > + * > > > + * @block: the block to release > > > + * > > > + * This is called when fs-verity is done with a block obtained with > > > + * ->read_merkle_tree_block(). > > > + */ > > > + void (*drop_block)(struct fsverity_blockbuf *block); > > > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block > > Yep. I noticed that xfs_verity.c doesn't put them together, which made > me wonder if the write_merkle_tree_block path made use of that. It > doesn't, AFAICT. > > And I think the reason is that when we're setting up the merkle tree, > we want to stream the contents straight to disk instead of ending up > with a huge cache that might not all be necessary? In the current patchset, fsverity_blockbuf isn't used for writes (despite the comment saying it is). I think that's fine and that it keeps things simpler. Filesystems can still cache the blocks that are passed to ->write_merkle_tree_block if they want to. That already happens on ext4 and f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache. - Eric