Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()

"Theodore Ts'o" <tytso@xxxxxxx> · Thu, 9 May 2024 12:39:53 -0400

On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote:
> 
> It's looks like it's easy to trigger an infinite loop here using fstest
> generic/039.  If I understand it correctly (which doesn't happen as often
> as I'd like), this is due to an integer overflow in the 'if' condition,
> and should be fixed with the patch below.

Thanks for the report.  However, I can't reproduce the failure, and
looking at generic/039, I don't see how it could be relevant to the
code path in question.  Generic/039 creates a test symlink with two
hard links in the same directory, syncs the file system, and then
removes one of the hard links, and then drops access to the block
device using dmflakey.  So I don't see how the extent code would be
involved at all.  Are you sure that you have the correct test listed?

Looking at the code in question in fs/ext4/extents.c:

again:
	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
				  hole_start + len - 1, &es);
	if (!es.es_len)
		goto insert_hole;

  	 * There's a delalloc extent in the hole, handle it if the delalloc
  	 * extent is in front of, behind and straddle the queried range.
  	 */
 -	if (lblk >= es.es_lblk + es.es_len) {
 +	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
  		/*
  		 * The delalloc extent is in front of the queried range,
  		 * find again from the queried start block.
		len -= lblk - hole_start;
		hole_start = lblk;
		goto again;

lblk and es.es_lblk are both __u32.  So the infinite loop is
presumably because es.es_lblk + es.es_len has overflowed.  This should
never happen(tm), and in fact we have a test for this case which
*should* have gotten tripped when ext4_es_find_extent_range() calls
__es_tree_search() in fs/ext4/extents_status.c:

static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
{
	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
	return es->es_lblk + es->es_len - 1;
}

So the patch is harmless, and I can see how it might fix what you were
seeing --- but I'm a bit nervous that I can't reproduce it and the
commit description claims that it reproduces easily; and we should
have never allowed the entry to have gotten introduced into the
extents status tree in the first place, and if it had been introduced,
it should have been caught before it was returned by
ext4_es_find_extent_range().

Can you give more details about the reproducer; can you double check
the test id, and how easily you can trigger the failure, and what is
the hardware you used to run the test?

Many thanks,

					- Ted