Re: [PATCH 18/26] xfs: use merkle tree offset as attr hash

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 9 May 2024 13:02:50 -0700

On Wed, May 08, 2024 at 10:02:52PM -0700, Christoph Hellwig wrote:
> On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote:
> > I guess we could make it really obvious by allocating range in the
> > mapping starting at MAX_FILEOFF and going downwards.  Chances are pretty
> > good that with the xattr info growing upwards they're never going to
> > meet.
> 
> Yes, although I'd avoid taking chances.  More below.
> 
> > > Or we decide the space above 2^32 blocks can't be used by attrs,
> > > and only by other users with other means of discover.  Say the
> > > verify hashes..
> > 
> > Well right now they can't be used by attrs because xfs_dablk_t isn't big
> > enough to fit a larger value.
> 
> Yes.

Thinking about this further, I think building the merkle tree becomes a
lot more difficult than the current design.  At first I thought of
reserving a static partition in the attr fork address range, but got
bogged donw in figuring out how big the static partition has to be.

Downthread I realized that the maximum size of a merkle tree is actually
ULONG_MAX blocks, which means that on a 64-bit machine there effectively
is no limit.

For an 8EB file with sha256 hashes (32 bytes) and 4k blocks, we need
2^(63-12) hashes.  Assuming maximum loading factor of 128 hashes per
block, btheight spits out:

# xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha256 -n '$(( 2 ** (63 - 12) ))
merkle_sha256: best case per 4096-byte block: 128 records (leaf) / 128 keyptrs (node)
level 0: 2251799813685248 records, 17592186044416 blocks
level 1: 17592186044416 records, 137438953472 blocks
level 2: 137438953472 records, 1073741824 blocks
level 3: 1073741824 records, 8388608 blocks
level 4: 8388608 records, 65536 blocks
level 5: 65536 records, 512 blocks
level 6: 512 records, 4 blocks
level 7: 4 records, 1 block
8 levels, 17730707194373 blocks total

That's about 2^45 blocks.  If the hash is sha512, double those figures.
For a 1k block size, we get:

# xfs_db /dev/sdf -c 'btheight -b 1024 merkle_sha256 -n '$(( 2 ** (63 - 10) ))
merkle_sha256: best case per 1024-byte block: 32 records (leaf) / 32 keyptrs (node)
level 0: 9007199254740992 records, 281474976710656 blocks
level 1: 281474976710656 records, 8796093022208 blocks
level 2: 8796093022208 records, 274877906944 blocks
level 3: 274877906944 records, 8589934592 blocks
level 4: 8589934592 records, 268435456 blocks
level 5: 268435456 records, 8388608 blocks
level 6: 8388608 records, 262144 blocks
level 7: 262144 records, 8192 blocks
level 8: 8192 records, 256 blocks
level 9: 256 records, 8 blocks
level 10: 8 records, 1 block
11 levels, 290554814669065 blocks total

That would be 2^49 blocks but mercifully fsverity doesn't allow more
than 8 levels of tree.

So I don't think it's a good idea to create a hardcoded partition in the
attr fork for merkle tree data, since it would have to be absurdly large
for the common case of sub-1T files:

# xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha512 -n '$(( 2 ** (40 - 12) ))
merkle_sha512: best case per 4096-byte block: 64 records (leaf) / 64 keyptrs (node)
level 0: 268435456 records, 4194304 blocks
level 1: 4194304 records, 65536 blocks
level 2: 65536 records, 1024 blocks
level 3: 1024 records, 16 blocks
level 4: 16 records, 1 block
5 levels, 4260881 blocks total

That led me to the idea of dynamic partitioning, where we find a sparse
part of the attr fork fileoff range and use that.  That burns a lot less
address range but means that we cannot elide merkle tree blocks that
contain entirely hash(zeroes) because elided blocks become sparse holes
in the attr fork, and xfs_bmap_first_unused can still find those holes.
I guess you could mark those blocks as unwritten, but that wastes space.

Setting even /that/ aside, how would we allocate/map the range?  I think
we'd want a second ATTR_VERITY attr to claim ownership of whatever attr
fork range we found.  xfs_fsverity_write_merkle would have to do
something like this, pretending that the merkle tree blocksize matches
the fs blocksize:

	offset = /* merkle tree pos to attr fork xfs_fileoff_t */
	xfs_trans_alloc()

	xfs_bmapi_write(..., offset, 1...);

	xfs_trans_buf_get()
	/* copy merkle tree contents */
	xfs_trans_log_buf()

	/* update verity extent attr value */
	xfs_attr_defer_add("verity.merkle",
			{fileoff: /* start of range */,
			 blockcount: /* blocks mapped so far */
			});

	xfs_trans_commit()

Note that xfs_fsverity_begin_enable would have to ensure that there's an
attr fork and that it's not in local format.  On the plus side, doing
all this transactionally means we have a second user of logged xattr
updates. :P

Online repair would need to grow new code to copy the merkle tree.

Tearing down the merkle tree (aka if tree setup fails and someone wants
to try again) use the "verity.merkle" attr to figure out which blocks to
clobber.

Caching at least is pretty easy, look up the "verity.merkle" attribute
to find the fileoff, compute the fileoff of the particular block we want
in the attr fork, xfs_buf_read the buffer, and toss the contents in the
incore cache that we have now.

<shrug> What do you think of that?

--D

> > The dangerous part here is that the code
> > silently truncates the outparam of xfs_bmap_first_unused, so I'll fix
> > that too.
> 
> Well, we should check for that in xfs_attr_rmt_find_hole /
> xfs_da_grow_inode_int, totally independent of the fsverity work.
> The condition is basically impossible to hit right now, but I'd rather
> make sure we do have a solid check.  I'll prepare a patch for it.
> 
>