On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote: > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote: > > One of essential ideas of fs-verity is that pages which are already > > verified won't need to be re-verified if they still in page cache. > > > > XFS will store Merkle tree blocks in extended file attributes. When > > read extended attribute data is put into xfs_buf. > > > > fs-verity uses PG_checked flag to track status of the blocks in the > > page. This flag can has two meanings - page was re-instantiated and > > the only block in the page is verified. > > > > However, in XFS, the data in the buffer is not aligned with xfs_buf > > pages and we don't have a reference to these pages. Moreover, these > > pages are released when value is copied out in xfs_attr code. In > > other words, we can not directly mark underlying xfs_buf's pages as > > verified as it's done by fs-verity for other filesystems. > > > > One way to track that these pages were processed by fs-verity is to > > mark buffer as verified instead. If buffer is evicted the incore > > XBF_VERITY_SEEN flag is lost. When the xattr is read again > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's > > flag is then used to tell fs-verity this buffer was cached or not. > > > > The second state indicated by PG_checked is if the only block in the > > PAGE is verified. This is not the case for XFS as there could be > > multiple blocks in single buffer (page size 64k block size 4k). This > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for > > XFS despite of Merkle tree block size. > > > > The meaning of the flag is that value of the extended attribute in > > the buffer is processed by fs-verity. > > > > Signed-off-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx> > > --- > > fs/xfs/xfs_buf.h | 18 ++++++++++-------- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > index 73249abca968..2a73918193ba 100644 > > --- a/fs/xfs/xfs_buf.h > > +++ b/fs/xfs/xfs_buf.h > > @@ -24,14 +24,15 @@ struct xfs_buf; > > > > #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) > > > > -#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > -#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > -#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > -#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > -#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > -#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > -#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > -#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > +#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > +#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > +#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > +#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > +#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > +#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > +#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > +#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > +#define XBF_VERITY_SEEN (1u << 8) /* buffer was processed by fs-verity */ > > Yuck. I still dislike this entire approach. > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that > gets loaded on behalf of a merkle tree request, then uses the extra > space to shadow the contents of the ondisk block. AFAICT the shadow > doesn't get updated even if the cached data does, which sounds like a > landmine for coherency issues. > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of > exposing the low level buffering details of the xattr code to > xfs_attr_get callers. > > XBF_VERITY_SEEN is a layering violation because now the overall buffer > cache can track file metadata state. I think the reason why you need > this state flag is because the datadev buffer target indexes based on > physical xfs_daddr_t, whereas merkle tree blocks have their own internal > block numbers. You can't directly go from the merkle block number to an > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block > fell out of memory. > > ISTR asking for a separation of these indices when I reviewed some > previous version of this patchset. At the time it seems to me that a > much more efficient way to cache the merkle tree blocks would be to set > up a direct (merkle tree block number) -> (blob of data) lookup table. > That I don't see here. > > In the spirit of the recent collaboration style that I've taken with > Christoph, I pulled your branch and started appending patches to it to > see if the design that I'm asking for is reasonable. As it so happens, > I was working on a simplified version of the xfs buffer cache ("fsbuf") > that could be used by simple filesystems to get them off of buffer > heads. > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g > verity tests), so now I know the idea is reasonable. Patches 11, 12, > 14, and 15 become unnecessary. However, this solution is itself grossly > overengineered, since all we want are the following operations: > > peek(key): returns an fsbuf if there's any data cached for key > > get(key): returns an fsbuf for key, regardless of state > > store(fsbuf, p): attach a memory buffer p to fsbuf > > Then the xfs ->read_merkle_tree_block function becomes: > > bp = peek(key) > if (bp) > /* return bp data up to verity */ > > p = xfs_attr_get(key) > if (!p) > /* error */ > > bp = get(key) > store(bp, p) Ok, that looks good - it definitely gets rid of a lot of the nastiness, but I have to ask: why does it need to be based on xfs_bufs? That's just wasting 300 bytes of memory on a handle to store a key and a opaque blob in a rhashtable. IIUC, the key here is a sequential index, so an xarray would be a much better choice as it doesn't require internal storage of the key. i.e. p = xa_load(key); if (p) return p; xfs_attr_get(key); if (!args->value) /* error */ /* * store the current value, freeing any old value that we * replaced at this key. Don't care about failure to store, * this is optimistic caching. */ p = xa_store(key, args->value, GFP_NOFS); if (p) kvfree(p); return args->value; -Dave. -- Dave Chinner david@xxxxxxxxxxxxx