On Tue, Dec 13, 2022 at 10:31:42PM -0800, Eric Biggers wrote: > On Wed, Dec 14, 2022 at 09:11:39AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote: > > > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote: > > > > Not yet implemented: > > > > - No pre-fetching of Merkle tree pages in the > > > > read_merkle_tree_page() > > > > > > This would be helpful, but not essential. > > > > > > > - No marking of already verified Merkle tree pages (each read, the > > > > whole tree is verified). > > > > Ah, I wasn't aware that this was missing. > > > > > > > > This is essential to have, IMO. > > > > > > You *could* do what btrfs does, where it caches the Merkle tree pages in the > > > inode's page cache past i_size, even though btrfs stores the Merkle tree > > > separately from the file data on-disk. > > > > > > However, I'd guess that the other XFS developers would have an adversion to that > > > approach, even though it would not affect the on-disk storage. > > > > Yup, on an architectural level it just seems wrong to cache secure > > verification metadata in the same user accessible address space as > > the data it verifies. > > > > > The alternatives would be to create a separate in-memory-only inode for the > > > cache, or to build a custom cache with its own shrinker. > > > > The merkel tree blocks are cached in the XFS buffer cache. > > > > Andrey, could we just add a new flag to the xfs_buf->b_flags to > > indicate that the buffer contains verified merkle tree records? > > i.e. if it's not set after we've read the buffer, we need to verify > > the buffer and set th verified buffer in cache and we can skip the > > verification? > > Well, my proposal at > https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@xxxxxxxxxx is to keep > tracking the "verified" status at the individual Merkle tree block level, by > adding a bitmap fsverity_info::hash_block_verified. That is part of the > fs/verity/ infrastructure, and all filesystems would be able to use it. Yeah, i had a look at that rewrite of the verification code last night - I get the gist of what it is doing, but a single patch of that complexity is largely impossible to sanely review... Correct me if I'm wrong, but won't using a bitmap with 1 bit per verified block cause problems with contiguous memory allocation pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is only 2GB of merkle tree data. Hence at file sizes of 100+GB, the bitmap would have to be kvmalloc()d to guarantee allocation will succeed. I'm not really worried about the bitmap memory usage, just that it handles large contiguous allocations sanely. I suspect we may eventually need a sparse bitmap (e.g. the old btrfs bit-radix implementation) to track verification in really large files efficiently. > However, since it's necessary to re-verify blocks that have been evicted and > then re-instantiated, my patch also repurposes PG_checked as an indicator for > whether the Merkle tree pages are newly instantiated. For a "non-page-cache > cache", that part would need to be replaced with something equivalent. Which we could get as a boolean state from the XFS buffer cache fairly easily - did we find the buffer in cache, or was it read from disk... > A different aproach would be to make it so that every time a page (or "cache > buffer", to call it something more generic) of N Merkle tree blocks is read, > then all N of those blocks are verified immediately. Then there would be no > need to track the "verified" status of individual blocks. That won't work with XFS - merkle tree blocks are not contiguous in the attribute b-tree so there is no efficient "sequential bulk read" option available. The xattr structure is largely chosen because it allows for fast, deterministic single merkle tree block operations.... > My concerns with that approach are: > > * Most data reads only need a single Merkle tree block at the deepest level. Yup, see above. :) > If at least N tree blocks were verified any time that any were verified at > all, that would make the worst-case read latency worse. *nod* > * It's possible that the parents of N tree blocks are split across a cache > buffer. Thus, while N blocks can't have more than N parents, and in > practice would just have 1-2, those 2 parents could be split into two > separate cache buffers, with a total length of 2*N. Verifying all of those > would really increase the worst-case latency as well. > > So I'm thinking that tracking the "verified" status of tree blocks individually > is the right way to go. But I'd appreciate any other thoughts on this. I think that having the fsverity code track verified indexes itself is a much more felxible and self contained and the right way to go about it. The other issue is that verify_page() assumes that it can drop the reference to the cached object itself - the caller actually owns the reference to the object, not the verify_page() code. Hence if we are passing opaque buffers to verify_page() rather page cache pages, we need a ->drop_block method that gets called instead of put_page(). This will allow the filesystem to hold a reference to the merkle tree block data while the verification occurs, ensuring that they don't get reclaimed by memory pressure whilst still in use... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx