On Wed, May 08, 2024 at 10:02:52PM -0700, Christoph Hellwig wrote: > On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote: > > I guess we could make it really obvious by allocating range in the > > mapping starting at MAX_FILEOFF and going downwards. Chances are pretty > > good that with the xattr info growing upwards they're never going to > > meet. > > Yes, although I'd avoid taking chances. More below. > > > > Or we decide the space above 2^32 blocks can't be used by attrs, > > > and only by other users with other means of discover. Say the > > > verify hashes.. > > > > Well right now they can't be used by attrs because xfs_dablk_t isn't big > > enough to fit a larger value. > > Yes. Thinking about this further, I think building the merkle tree becomes a lot more difficult than the current design. At first I thought of reserving a static partition in the attr fork address range, but got bogged donw in figuring out how big the static partition has to be. Downthread I realized that the maximum size of a merkle tree is actually ULONG_MAX blocks, which means that on a 64-bit machine there effectively is no limit. For an 8EB file with sha256 hashes (32 bytes) and 4k blocks, we need 2^(63-12) hashes. Assuming maximum loading factor of 128 hashes per block, btheight spits out: # xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha256 -n '$(( 2 ** (63 - 12) )) merkle_sha256: best case per 4096-byte block: 128 records (leaf) / 128 keyptrs (node) level 0: 2251799813685248 records, 17592186044416 blocks level 1: 17592186044416 records, 137438953472 blocks level 2: 137438953472 records, 1073741824 blocks level 3: 1073741824 records, 8388608 blocks level 4: 8388608 records, 65536 blocks level 5: 65536 records, 512 blocks level 6: 512 records, 4 blocks level 7: 4 records, 1 block 8 levels, 17730707194373 blocks total That's about 2^45 blocks. If the hash is sha512, double those figures. For a 1k block size, we get: # xfs_db /dev/sdf -c 'btheight -b 1024 merkle_sha256 -n '$(( 2 ** (63 - 10) )) merkle_sha256: best case per 1024-byte block: 32 records (leaf) / 32 keyptrs (node) level 0: 9007199254740992 records, 281474976710656 blocks level 1: 281474976710656 records, 8796093022208 blocks level 2: 8796093022208 records, 274877906944 blocks level 3: 274877906944 records, 8589934592 blocks level 4: 8589934592 records, 268435456 blocks level 5: 268435456 records, 8388608 blocks level 6: 8388608 records, 262144 blocks level 7: 262144 records, 8192 blocks level 8: 8192 records, 256 blocks level 9: 256 records, 8 blocks level 10: 8 records, 1 block 11 levels, 290554814669065 blocks total That would be 2^49 blocks but mercifully fsverity doesn't allow more than 8 levels of tree. So I don't think it's a good idea to create a hardcoded partition in the attr fork for merkle tree data, since it would have to be absurdly large for the common case of sub-1T files: # xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha512 -n '$(( 2 ** (40 - 12) )) merkle_sha512: best case per 4096-byte block: 64 records (leaf) / 64 keyptrs (node) level 0: 268435456 records, 4194304 blocks level 1: 4194304 records, 65536 blocks level 2: 65536 records, 1024 blocks level 3: 1024 records, 16 blocks level 4: 16 records, 1 block 5 levels, 4260881 blocks total That led me to the idea of dynamic partitioning, where we find a sparse part of the attr fork fileoff range and use that. That burns a lot less address range but means that we cannot elide merkle tree blocks that contain entirely hash(zeroes) because elided blocks become sparse holes in the attr fork, and xfs_bmap_first_unused can still find those holes. I guess you could mark those blocks as unwritten, but that wastes space. Setting even /that/ aside, how would we allocate/map the range? I think we'd want a second ATTR_VERITY attr to claim ownership of whatever attr fork range we found. xfs_fsverity_write_merkle would have to do something like this, pretending that the merkle tree blocksize matches the fs blocksize: offset = /* merkle tree pos to attr fork xfs_fileoff_t */ xfs_trans_alloc() xfs_bmapi_write(..., offset, 1...); xfs_trans_buf_get() /* copy merkle tree contents */ xfs_trans_log_buf() /* update verity extent attr value */ xfs_attr_defer_add("verity.merkle", {fileoff: /* start of range */, blockcount: /* blocks mapped so far */ }); xfs_trans_commit() Note that xfs_fsverity_begin_enable would have to ensure that there's an attr fork and that it's not in local format. On the plus side, doing all this transactionally means we have a second user of logged xattr updates. :P Online repair would need to grow new code to copy the merkle tree. Tearing down the merkle tree (aka if tree setup fails and someone wants to try again) use the "verity.merkle" attr to figure out which blocks to clobber. Caching at least is pretty easy, look up the "verity.merkle" attribute to find the fileoff, compute the fileoff of the particular block we want in the attr fork, xfs_buf_read the buffer, and toss the contents in the incore cache that we have now. <shrug> What do you think of that? --D > > The dangerous part here is that the code > > silently truncates the outparam of xfs_bmap_first_unused, so I'll fix > > that too. > > Well, we should check for that in xfs_attr_rmt_find_hole / > xfs_da_grow_inode_int, totally independent of the fsverity work. > The condition is basically impossible to hit right now, but I'd rather > make sure we do have a solid check. I'll prepare a patch for it. > >