On Thu, Mar 07, 2024 at 02:49:03PM -0800, Eric Biggers wrote: > On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote: > > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote: > > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote: > > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h > > > > index b3506f56e180..dad33e6ff0d6 100644 > > > > --- a/fs/verity/fsverity_private.h > > > > +++ b/fs/verity/fsverity_private.h > > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void) > > > > > > > > void __init fsverity_init_workqueue(void); > > > > > > > > +/* > > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to > > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the > > > > + * block->context. > > > > + */ > > > > +void fsverity_drop_block(struct inode *inode, > > > > + struct fsverity_blockbuf *block); > > > > + > > > > #endif /* _FSVERITY_PRIVATE_H */ > > > > > > This should be paired with a helper function that reads a Merkle tree block by > > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed. Besides > > > being consistent with having a helper function for drop, this would prevent code > > > duplication between verify_data_block() and fsverity_read_merkle_tree(). > > > > > > I recommend that it look like this: > > > > > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos, > > > unsigned long ra_bytes, > > > struct fsverity_blockbuf *block); > > > > > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes' > > > would be the number of bytes for the filesystem to (optionally) readahead if the > > > block is not yet cached. I think that things work out simpler if these values > > > are measured in bytes, not blocks. 'block' would be at the end because it's an > > > output, and it can be confusing to interleave inputs and outputs in parameters. > > > > FWIW I don't really like 'pos' here because that's usually short for > > "file position", which is a byte, and this looks a lot more like a > > merkle tree block number. > > > > u64 blkno? > > > > Or better yet use a typedef ("merkle_blkno_t") to make it really clear > > when we're dealing with a tree block number. Ignore checkpatch > > complaining about typeedefs. :) > > My suggestion is for 'pos' to be a byte position, in alignment with the > pagecache naming convention as well as ->write_merkle_tree_block which currently > uses a 'u64 pos' byte position too. > > It would also work for it to be a block index, in which case it should be named > 'index' or 'blkno' and have type unsigned long. I *think* that things work out > a bit cleaner if it's a byte position, but I could be convinced that it's > actually better for it to be a block index. It's probably cleaner (or at least willy says so) to measure everything in bytes. The only place that gets messy is if we want to cache merkle tree blocks in an xarray or something, in which case we'd want to >> by the block_shift to avoid leaving gaps. > > > How about changing the prototype to: > > > > > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos); > > > > > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or > > > can it happen in cases like filesystem corruption? If it can only happen due to > > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message. > > > > I think XFS only passes to _invalidate_* the same pos that was passed to > > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption > > problem. > > > > Perhaps this function ought to note that @pos is supposed to be the same > > value that was given to ->read_merkle_tree_block? > > > > Or: make the implementations return 1 for "reloaded from disk", 0 for > > "still in cache", or a negative error code. Then fsverity can call > > the invalidation routine itself and XFS doesn't have to worry about this > > part. > > > > (I think? I have questions about the xfs_invalidate_blocks function.) > > It looks like XFS can invalidate blocks other than the one being read by > ->read_merkle_tree_block. > > If it really was only a matter of the single block being read, then it would > indeed be simpler to just make it a piece of information returned from > ->read_merkle_tree_block. > > If the generic invalidation function is needed, it needs to be clearly > documented when filesystems are expected to invalidate blocks. I /think/ the generic invalidation is only necessary with the weird DOUBLE_ALLOC thing. If the other two implementations (ext4/f2fs) haven't needed a "nuke from orbit" function then xfs shouldn't require one too. > > > > + > > > > + /** > > > > + * Release the reference to a Merkle tree block > > > > + * > > > > + * @block: the block to release > > > > + * > > > > + * This is called when fs-verity is done with a block obtained with > > > > + * ->read_merkle_tree_block(). > > > > + */ > > > > + void (*drop_block)(struct fsverity_blockbuf *block); > > > > > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block > > > > Yep. I noticed that xfs_verity.c doesn't put them together, which made > > me wonder if the write_merkle_tree_block path made use of that. It > > doesn't, AFAICT. > > > > And I think the reason is that when we're setting up the merkle tree, > > we want to stream the contents straight to disk instead of ending up > > with a huge cache that might not all be necessary? > > In the current patchset, fsverity_blockbuf isn't used for writes (despite the > comment saying it is). I think that's fine and that it keeps things simpler. > Filesystems can still cache the blocks that are passed to > ->write_merkle_tree_block if they want to. That already happens on ext4 and > f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache. Oh! Maybe it should do that then. At least the root block? --D > - Eric >