Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()

David Hildenbrand <david@xxxxxxxxxx> · Tue, 12 Mar 2024 16:13:12 +0100

On 11.03.24 23:38, Darrick J. Wong wrote:
[add willy and linux-mm]

On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
BTW, is xfs_repair planned to do anything about any such extra blocks?

Sorry to answer your question with a question, but how much checking is
$filesystem expected to do for merkle trees?

In theory xfs_repair could learn how to interpret the verity descriptor,
walk the merkle tree blocks, and even read the file data to confirm
intactness.  If the descriptor specifies the highest block address then
we could certainly trim off excess blocks.  But I don't know how much of
libfsverity actually lets you do that; I haven't looked into that
deeply. :/

For xfs_scrub I guess the job is theoretically simpler, since we only
need to stream reads of the verity files through the page cache and let
verity tell us if the file data are consistent.

For both tools, if something finds errors in the merkle tree structure
itself, do we turn off verity?  Or do we do something nasty like
truncate the file?

As far as I know (I haven't been following btrfs-progs, but I'm familiar with
e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
validating the data of verity inodes against their Merkle trees.

e2fsck does delete the verity metadata of inodes that don't have the verity flag
enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.

I suppose that ideally, if an inode's verity metadata is invalid, then fsck
should delete that inode's verity metadata and remove the verity flag from the
inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
fairly straightforward, but it probably wouldn't catch much compared to actually
validating the data against the Merkle tree.  And actually validating the data
against the Merkle tree would be complex and expensive.  Note, none of this
would work on files that are encrypted.

Re: libfsverity, I think it would be possible to validate a Merkle tree using
libfsverity_compute_digest() and the callbacks that it supports.  But that's not
quite what it was designed for.

Is there an ioctl or something that allows userspace to validate an
entire file's contents?  Sort of like what BLKVERIFY would have done for
block devices, except that we might believe its answers?

Just reading the whole file and seeing whether you get an error would do it.

Though if you want to make sure it's really re-reading the on-disk data, it's
necessary to drop the file's pagecache first.

I tried a straight pagecache read and it worked like a charm!

But then I thought to myself, do I really want to waste memory bandwidth
copying a bunch of data?  No.  I don't even want to incur system call
overhead from reading a single byte every $pagesize bytes.

So I created 2M mmap areas and read a byte every $pagesize bytes.  That
worked too, insofar as SIGBUSes are annoying to handle.  But it's
annoying to take signals like that.

Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
like what I wanted -- it prefaults in the pages, and "If populating
fails, a SIGBUS signal is not generated; instead, an error is returned."

Yes, these were the expected semantics :)

But then I tried rigging up a test to see if I could catch an EIO, and
instead I had to SIGKILL the process!  It looks filemap_fault returns
VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
__do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
the VM_FAULT_RETRY is translated to -EBUSY.

__get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
that to madvise_populate.  Unfortunately, madvise_populate increments
its loop counter by the return value (still 0) so it runs in an
infinite loop.  The only way out is SIGKILL.

That's certainly unexpected. One user I know is QEMU, which primarily 
uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is 
primarily used with shmem/hugetlb, where I haven't heard of any such 
endless loops.

So I don't know what the correct behavior is here, other than the
infinite loop seems pretty suspect.  Is it the correct behavior that
madvise_populate returns EIO if __get_user_pages ever returns zero?
That doesn't quite sound right if it's the case that a zero return could
also happen if memory is tight.

madvise_populate() ends up calling faultin_vma_page_range() in a loop. 
That one calls __get_user_pages().

__get_user_pages() documents: "0 return value is possible when the fault 
would need to be retried."

So that's what the caller does. IIRC, there are cases where we really 
have to retry (at least once) and will make progress, so treating "0" as 
an error would be wrong.

Staring at other __get_user_pages() users, __get_user_pages_locked() 
documents: "Please note that this function, unlike __get_user_pages(), 
will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".

But there is some elaborate retry logic in there, whereby the retry will 
set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second 
retry attempt (there are cases where we retry more often, but that's 
related to something else I believe).

So maybe we need a similar retry logic in faultin_vma_page_range()? Or 
make it use __get_user_pages_locked(), but I recall when I introduced 
MADV_POPULATE_READ, there was a catch to it.

--
Cheers,

David / dhildenb