Re: [RFC PATCH 00/11] fs-verity support for XFS

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 15 Dec 2022 10:06:32 +1100

On Tue, Dec 13, 2022 at 10:31:42PM -0800, Eric Biggers wrote:
> On Wed, Dec 14, 2022 at 09:11:39AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote:
> > > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote:
> > > > Not yet implemented:
> > > > - No pre-fetching of Merkle tree pages in the
> > > >   read_merkle_tree_page()
> > > 
> > > This would be helpful, but not essential.
> > > 
> > > > - No marking of already verified Merkle tree pages (each read, the
> > > >   whole tree is verified).
> > 
> > Ah, I wasn't aware that this was missing.
> > 
> > > 
> > > This is essential to have, IMO.
> > > 
> > > You *could* do what btrfs does, where it caches the Merkle tree pages in the
> > > inode's page cache past i_size, even though btrfs stores the Merkle tree
> > > separately from the file data on-disk.
> > >
> > > However, I'd guess that the other XFS developers would have an adversion to that
> > > approach, even though it would not affect the on-disk storage.
> > 
> > Yup, on an architectural level it just seems wrong to cache secure
> > verification metadata in the same user accessible address space as
> > the data it verifies.
> > 
> > > The alternatives would be to create a separate in-memory-only inode for the
> > > cache, or to build a custom cache with its own shrinker.
> > 
> > The merkel tree blocks are cached in the XFS buffer cache.
> > 
> > Andrey, could we just add a new flag to the xfs_buf->b_flags to
> > indicate that the buffer contains verified merkle tree records?
> > i.e. if it's not set after we've read the buffer, we need to verify
> > the buffer and set th verified buffer in cache and we can skip the
> > verification?
> 
> Well, my proposal at
> https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@xxxxxxxxxx is to keep
> tracking the "verified" status at the individual Merkle tree block level, by
> adding a bitmap fsverity_info::hash_block_verified.  That is part of the
> fs/verity/ infrastructure, and all filesystems would be able to use it.

Yeah, i had a look at that rewrite of the verification code last
night - I get the gist of what it is doing, but a single patch of
that complexity is largely impossible to sanely review...

Correct me if I'm wrong, but won't using a bitmap with 1 bit per
verified block cause problems with contiguous memory allocation
pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is
only 2GB of merkle tree data. Hence at file sizes of 100+GB, the
bitmap would have to be kvmalloc()d to guarantee allocation will
succeed.

I'm not really worried about the bitmap memory usage, just that it
handles large contiguous allocations sanely. I suspect we may
eventually need a sparse bitmap (e.g. the old btrfs bit-radix
implementation) to track verification in really large files
efficiently.

> However, since it's necessary to re-verify blocks that have been evicted and
> then re-instantiated, my patch also repurposes PG_checked as an indicator for
> whether the Merkle tree pages are newly instantiated.  For a "non-page-cache
> cache", that part would need to be replaced with something equivalent.

Which we could get as a boolean state from the XFS buffer cache
fairly easily - did we find the buffer in cache, or was it read from
disk...

> A different aproach would be to make it so that every time a page (or "cache
> buffer", to call it something more generic) of N Merkle tree blocks is read,
> then all N of those blocks are verified immediately.  Then there would be no
> need to track the "verified" status of individual blocks.

That won't work with XFS - merkle tree blocks are not contiguous in
the attribute b-tree so there is no efficient "sequential bulk read"
option available. The xattr structure is largely chosen because it
allows for fast, deterministic single merkle tree block
operations....

> My concerns with that approach are:
> 
>   * Most data reads only need a single Merkle tree block at the deepest level.

Yup, see above. :)

>     If at least N tree blocks were verified any time that any were verified at
>     all, that would make the worst-case read latency worse.

*nod*

>   * It's possible that the parents of N tree blocks are split across a cache
>     buffer.  Thus, while N blocks can't have more than N parents, and in
>     practice would just have 1-2, those 2 parents could be split into two
>     separate cache buffers, with a total length of 2*N.  Verifying all of those
>     would really increase the worst-case latency as well.
> 
> So I'm thinking that tracking the "verified" status of tree blocks individually
> is the right way to go.  But I'd appreciate any other thoughts on this.

I think that having the fsverity code track verified indexes itself
is a much more felxible and self contained and the right way to go
about it.

The other issue is that verify_page() assumes that it can drop the
reference to the cached object itself - the caller actually owns the
reference to the object, not the verify_page() code. Hence if we are
passing opaque buffers to verify_page() rather page cache pages, we
need a ->drop_block method that gets called instead of put_page().
This will allow the filesystem to hold a reference to the merkle
tree block data while the verification occurs, ensuring that they
don't get reclaimed by memory pressure whilst still in use...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx