Re: [PATCH 10/13] fsverity: pass the zero-hash value to the implementation

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 24 Apr 2024 14:43:17 -0700

On Wed, Apr 24, 2024 at 08:59:41PM +0000, Eric Biggers wrote:
> On Wed, Apr 24, 2024 at 01:23:48PM -0700, Darrick J. Wong wrote:
> > > > How about "the hash of an i_blocksize-sized buffer of zeroes" for all
> > > > three?
> > > 
> > > It's the Merkle tree block size, not the filesystem block size.  Or did you
> > > actually intend for this to use the filesystem block size?
> > 
> > I actually did intend for this to be the fs block size, not the merkle
> > tree block size.  It's the bottom level that I care about shrinking.
> > Let's say that data[0-B] are the data blocks:
> > 
> > root
> >  +-internal0
> >  |   +-leaf0
> >  |   |   +-data0
> >  |   |   +-data1
> >  |   |   `-data2
> >  |   `-leaf1
> >  |       +-data3
> >  |       +-data4
> >  |       `-data5
> >  `-internal1
> >      +-leaf2
> >      |   +-data6
> >      |   +-data7
> >      |   `-data8
> >      `-leaf3
> >          +-data9
> >          +-dataA
> >          `-dataB
> > 
> > (thanks to https://arthursonzogni.com/Diagon/#Tree )
> > 
> > If data[3-5] are completely zeroes (unwritten blocks, sparse holes,
> > etc.) then I want to skip writing leaf1 of the merkle tree to disk.
> > 
> > If it happens that the hashes of leaf[0-1] match hash(data3) then it's
> > frosting on top (as it were) that we can also skip internal0.  However,
> > the merkle tree has a high fanout factor (4096/32==128 in the common
> > case), so I care /much/ less about eliding those levels.
> > 
> > > In struct merkle_tree_params, the "block size" is always the Merkle tree block
> > > size, so the type of block size seems clear in that context.  My complaint was
> > > just that it used the term "data block" to mean a block that is not necessarily
> > > a file contents block (which is what "data block" means elsewhere).
> > 
> > Hm.  Given the confusion, would it help if I said that zero_digest
> > should only be used to elide leaf nodes of the merkle tree that hash the
> > contents of file content blocks?  Or is "the hash of an
> > i_blocksize-sized buffer of zeroes" sufficient?
> > 
> > What do you think of the commit message saying:
> > 
> > "Compute the hash of one filesystem block's worth of zeroes.  Any merkle
> > tree leaf block containing only this hash can be elided at write time,
> > and its contents synthesized at read time.
> > 
> > "Let's pretend that there's a file containing six data blocks and whose
> > merkle tree looks roughly like this:
> > 
> > root
> >  +--leaf0
> >  |   +--data0
> >  |   +--data1
> >  |   `--data2
> >  `--leaf1
> >      +--data3
> >      +--data4
> >      `--data5
> > 
> > "If data[0-2] are sparse holes, then leaf0 will contain a repeating
> > sequence of @zero_digest.  Therefore, leaf0 need not be written to disk
> > because its contents can be synthesized."
> 
> It sounds like you're assuming that the file data is always hashed in filesystem
> block sized units.

Ohh!  Yes, I was making that assumption, and now I double-checked
enable.c and see this:

	/* Hash each data block, also hashing the tree blocks as they fill up */
	for (offset = 0; offset < data_size; offset += params->block_size) {
		ssize_t bytes_read;
		loff_t pos = offset;

		buffers[-1].filled = min_t(u64, params->block_size,
					   data_size - offset);
		bytes_read = __kernel_read(filp, buffers[-1].data,
					   buffers[-1].filled, &pos);

So yes, you're right, @zero_digest is a the hash of a *merkle tree
block-sized* buffer of zeroes.  And if ->write_merkle_tree_block sees
that the block is a repeating sequence of @zero_digest, it can skip
writing that block to disk, no matter where that block happens to be in
the tree.

> block sized units.  That's not how it works.  The block size that's selected for
> fsverity (which is a power of 2 between 1024 and min(fs_block_size, PAGE_SIZE),
> inclusively) is used for both the data blocks and the Merkle tree blocks.
> 
> This is intentional, so that people can e.g. calculate the fsverity digest of a
> file using a 4K block size, and deploy the file to both filesystems that use a
> 4K filesystem block size and filesystems that use a 16K filesystem block size,
> and get the same fsverity file digest each time.

Aha, yes, that makes more sense.  I had wondered if people actually
copied merkle tree data between filesystems.

> I've considered offering the ability to configure the data block size separately
> from the Merkle tree block size, like what dm-verity does.  This hasn't seemed
> useful, though.  And in any case, it should not be tied to the FS block size.
> 
> A better way to think about things might be that the Merkle tree actually
> *includes* the data, as opposed to being separate from it.  In this respect,
> it's natural that the Merkle tree parameters including block size, hash
> algorithm, and salt apply both to blocks that contain file data and to blocks
> that contain hashes.  In general, all the fsverity code and documentation
> probably needs to be clearer about whether, when referring to the Merkle tree,
> it means just the hash blocks, or if it means the conceptual full tree that
> includes the file's data.

Yes, that clears things right up.  Thank you for correcting me. :)

--D

> - Eric
>