Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Sat, 9 Mar 2024 08:28:28 -0800

On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> > > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > > > One of essential ideas of fs-verity is that pages which are already
> > > > verified won't need to be re-verified if they still in page cache.
> > > > 
> > > > XFS will store Merkle tree blocks in extended file attributes. When
> > > > read extended attribute data is put into xfs_buf.
> > > > 
> > > > fs-verity uses PG_checked flag to track status of the blocks in the
> > > > page. This flag can has two meanings - page was re-instantiated and
> > > > the only block in the page is verified.
> > > > 
> > > > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > > > pages and we don't have a reference to these pages. Moreover, these
> > > > pages are released when value is copied out in xfs_attr code. In
> > > > other words, we can not directly mark underlying xfs_buf's pages as
> > > > verified as it's done by fs-verity for other filesystems.
> > > > 
> > > > One way to track that these pages were processed by fs-verity is to
> > > > mark buffer as verified instead. If buffer is evicted the incore
> > > > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > > > flag is then used to tell fs-verity this buffer was cached or not.
> > > > 
> > > > The second state indicated by PG_checked is if the only block in the
> > > > PAGE is verified. This is not the case for XFS as there could be
> > > > multiple blocks in single buffer (page size 64k block size 4k). This
> > > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > > > XFS despite of Merkle tree block size.
> > > > 
> > > > The meaning of the flag is that value of the extended attribute in
> > > > the buffer is processed by fs-verity.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx>
> > > > ---
> > > >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> > > >  1 file changed, 10 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > > index 73249abca968..2a73918193ba 100644
> > > > --- a/fs/xfs/xfs_buf.h
> > > > +++ b/fs/xfs/xfs_buf.h
> > > > @@ -24,14 +24,15 @@ struct xfs_buf;
> > > >  
> > > >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> > > >  
> > > > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > > > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > > > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > > > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > > > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > > > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > > > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > > > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > > > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > > > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > > > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > > > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > > > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > > > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > > > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> > > 
> > > Yuck.  I still dislike this entire approach.
> > > 
> > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> > > gets loaded on behalf of a merkle tree request, then uses the extra
> > > space to shadow the contents of the ondisk block.  AFAICT the shadow
> > > doesn't get updated even if the cached data does, which sounds like a
> > > landmine for coherency issues.
> > >
> > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> > > exposing the low level buffering details of the xattr code to
> > > xfs_attr_get callers.
> > > 
> > > XBF_VERITY_SEEN is a layering violation because now the overall buffer
> > > cache can track file metadata state.  I think the reason why you need
> > > this state flag is because the datadev buffer target indexes based on
> > > physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> > > block numbers.  You can't directly go from the merkle block number to an
> > > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> > > fell out of memory.
> > >
> > > ISTR asking for a separation of these indices when I reviewed some
> > > previous version of this patchset.  At the time it seems to me that a
> > > much more efficient way to cache the merkle tree blocks would be to set
> > > up a direct (merkle tree block number) -> (blob of data) lookup table.
> > > That I don't see here.
> > >
> > > In the spirit of the recent collaboration style that I've taken with
> > > Christoph, I pulled your branch and started appending patches to it to
> > > see if the design that I'm asking for is reasonable.  As it so happens,
> > > I was working on a simplified version of the xfs buffer cache ("fsbuf")
> > > that could be used by simple filesystems to get them off of buffer
> > > heads.
> > > 
> > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > overengineered, since all we want are the following operations:
> > > 
> > > peek(key): returns an fsbuf if there's any data cached for key
> > > 
> > > get(key): returns an fsbuf for key, regardless of state
> > > 
> > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > 
> > > Then the xfs ->read_merkle_tree_block function becomes:
> > > 
> > > 	bp = peek(key)
> > > 	if (bp)
> > > 		/* return bp data up to verity */
> > > 
> > > 	p = xfs_attr_get(key)
> > > 	if (!p)
> > > 		/* error */
> > > 
> > > 	bp = get(key)
> > > 	store(bp, p)
> > 
> > Ok, that looks good - it definitely gets rid of a lot of the
> > nastiness, but I have to ask: why does it need to be based on
> > xfs_bufs?
> 
> (copying from IRC) It was still warm in my brain L2 after all the xfile
> buftarg cleaning and merging that just got done a few weeks ago.   So I
> went with the simplest thing I could rig up to test my ideas, and now
> we're at the madly iterate until exhaustion stage. ;)
> 
> >            That's just wasting 300 bytes of memory on a handle to
> > store a key and a opaque blob in a rhashtable.
> 
> Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> code.  I agree that I ought to go look at xarrays or something that's
> more of a direct mapping as a next step.  However, i wanted to get
> Andrey's feedback on this general approach first.
> 
> > IIUC, the key here is a sequential index, so an xarray would be a
> > much better choice as it doesn't require internal storage of the
> > key.
> 
> I wonder, what are the access patterns for merkle blobs?  Is it actually
> sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> 
> Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> that done because merkle trees may some day have more than 2^32 blocks
> in them?  That won't play well with things like xarrays on 32-bit
> machines.
> 
> (Granted we've been talking about deprecating XFS on 32-bit for a while
> now but we're not the whole world)
> 
> > i.e.
> > 
> > 	p = xa_load(key);
> > 	if (p)
> > 		return p;
> > 
> > 	xfs_attr_get(key);
> > 	if (!args->value)
> > 		/* error */
> > 
> > 	/*
> > 	 * store the current value, freeing any old value that we
> > 	 * replaced at this key. Don't care about failure to store,
> > 	 * this is optimistic caching.
> > 	 */
> > 	p = xa_store(key, args->value, GFP_NOFS);
> > 	if (p)
> > 		kvfree(p);
> > 	return args->value;
> 
> Attractive.  Will have to take a look at that tomorrow.

Done.  I think.  Not sure that I actually got all the interactions
between the shrinker and the xarray correct though.  KASAN and lockdep
don't have any complaints running fstests, so that's a start.

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09

--D

> 
> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > 
>