On Wed, Apr 24, 2024 at 08:59:41PM +0000, Eric Biggers wrote: > On Wed, Apr 24, 2024 at 01:23:48PM -0700, Darrick J. Wong wrote: > > > > How about "the hash of an i_blocksize-sized buffer of zeroes" for all > > > > three? > > > > > > It's the Merkle tree block size, not the filesystem block size. Or did you > > > actually intend for this to use the filesystem block size? > > > > I actually did intend for this to be the fs block size, not the merkle > > tree block size. It's the bottom level that I care about shrinking. > > Let's say that data[0-B] are the data blocks: > > > > root > > +-internal0 > > | +-leaf0 > > | | +-data0 > > | | +-data1 > > | | `-data2 > > | `-leaf1 > > | +-data3 > > | +-data4 > > | `-data5 > > `-internal1 > > +-leaf2 > > | +-data6 > > | +-data7 > > | `-data8 > > `-leaf3 > > +-data9 > > +-dataA > > `-dataB > > > > (thanks to https://arthursonzogni.com/Diagon/#Tree ) > > > > If data[3-5] are completely zeroes (unwritten blocks, sparse holes, > > etc.) then I want to skip writing leaf1 of the merkle tree to disk. > > > > If it happens that the hashes of leaf[0-1] match hash(data3) then it's > > frosting on top (as it were) that we can also skip internal0. However, > > the merkle tree has a high fanout factor (4096/32==128 in the common > > case), so I care /much/ less about eliding those levels. > > > > > In struct merkle_tree_params, the "block size" is always the Merkle tree block > > > size, so the type of block size seems clear in that context. My complaint was > > > just that it used the term "data block" to mean a block that is not necessarily > > > a file contents block (which is what "data block" means elsewhere). > > > > Hm. Given the confusion, would it help if I said that zero_digest > > should only be used to elide leaf nodes of the merkle tree that hash the > > contents of file content blocks? Or is "the hash of an > > i_blocksize-sized buffer of zeroes" sufficient? > > > > What do you think of the commit message saying: > > > > "Compute the hash of one filesystem block's worth of zeroes. Any merkle > > tree leaf block containing only this hash can be elided at write time, > > and its contents synthesized at read time. > > > > "Let's pretend that there's a file containing six data blocks and whose > > merkle tree looks roughly like this: > > > > root > > +--leaf0 > > | +--data0 > > | +--data1 > > | `--data2 > > `--leaf1 > > +--data3 > > +--data4 > > `--data5 > > > > "If data[0-2] are sparse holes, then leaf0 will contain a repeating > > sequence of @zero_digest. Therefore, leaf0 need not be written to disk > > because its contents can be synthesized." > > It sounds like you're assuming that the file data is always hashed in filesystem > block sized units. Ohh! Yes, I was making that assumption, and now I double-checked enable.c and see this: /* Hash each data block, also hashing the tree blocks as they fill up */ for (offset = 0; offset < data_size; offset += params->block_size) { ssize_t bytes_read; loff_t pos = offset; buffers[-1].filled = min_t(u64, params->block_size, data_size - offset); bytes_read = __kernel_read(filp, buffers[-1].data, buffers[-1].filled, &pos); So yes, you're right, @zero_digest is a the hash of a *merkle tree block-sized* buffer of zeroes. And if ->write_merkle_tree_block sees that the block is a repeating sequence of @zero_digest, it can skip writing that block to disk, no matter where that block happens to be in the tree. > block sized units. That's not how it works. The block size that's selected for > fsverity (which is a power of 2 between 1024 and min(fs_block_size, PAGE_SIZE), > inclusively) is used for both the data blocks and the Merkle tree blocks. > > This is intentional, so that people can e.g. calculate the fsverity digest of a > file using a 4K block size, and deploy the file to both filesystems that use a > 4K filesystem block size and filesystems that use a 16K filesystem block size, > and get the same fsverity file digest each time. Aha, yes, that makes more sense. I had wondered if people actually copied merkle tree data between filesystems. > I've considered offering the ability to configure the data block size separately > from the Merkle tree block size, like what dm-verity does. This hasn't seemed > useful, though. And in any case, it should not be tied to the FS block size. > > A better way to think about things might be that the Merkle tree actually > *includes* the data, as opposed to being separate from it. In this respect, > it's natural that the Merkle tree parameters including block size, hash > algorithm, and salt apply both to blocks that contain file data and to blocks > that contain hashes. In general, all the fsverity code and documentation > probably needs to be clearer about whether, when referring to the Merkle tree, > it means just the hash blocks, or if it means the conceptual full tree that > includes the file's data. Yes, that clears things right up. Thank you for correcting me. :) --D > - Eric >