On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote; > On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote: >> >> It's looks like it's easy to trigger an infinite loop here using fstest >> generic/039. If I understand it correctly (which doesn't happen as often >> as I'd like), this is due to an integer overflow in the 'if' condition, >> and should be fixed with the patch below. > > Thanks for the report. However, I can't reproduce the failure, and > looking at generic/039, I don't see how it could be relevant to the > code path in question. Generic/039 creates a test symlink with two > hard links in the same directory, syncs the file system, and then > removes one of the hard links, and then drops access to the block > device using dmflakey. So I don't see how the extent code would be > involved at all. Are you sure that you have the correct test listed? Yep, I just retested and it's definitely generic/039. I'm using a simple test environment, with virtme-ng. > Looking at the code in question in fs/ext4/extents.c: > > again: > ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start, > hole_start + len - 1, &es); > if (!es.es_len) > goto insert_hole; > > * There's a delalloc extent in the hole, handle it if the delalloc > * extent is in front of, behind and straddle the queried range. > */ > - if (lblk >= es.es_lblk + es.es_len) { > + if (lblk >= ((__u64) es.es_lblk) + es.es_len) { > /* > * The delalloc extent is in front of the queried range, > * find again from the queried start block. > len -= lblk - hole_start; > hole_start = lblk; > goto again; > > lblk and es.es_lblk are both __u32. So the infinite loop is > presumably because es.es_lblk + es.es_len has overflowed. This should > never happen(tm), and in fact we have a test for this case which If I instrument the code, I can see that es.es_len is definitely set to EXT_MAX_BLOCKS, which will overflow. > *should* have gotten tripped when ext4_es_find_extent_range() calls > __es_tree_search() in fs/ext4/extents_status.c: > > static inline ext4_lblk_t ext4_es_end(struct extent_status *es) > { > BUG_ON(es->es_lblk + es->es_len < es->es_lblk); > return es->es_lblk + es->es_len - 1; > } > > So the patch is harmless, and I can see how it might fix what you were > seeing --- but I'm a bit nervous that I can't reproduce it and the > commit description claims that it reproduces easily; and we should > have never allowed the entry to have gotten introduced into the > extents status tree in the first place, and if it had been introduced, > it should have been caught before it was returned by > ext4_es_find_extent_range(). > > Can you give more details about the reproducer; can you double check > the test id, and how easily you can trigger the failure, and what is > the hardware you used to run the test? So, here's few more details that may clarify, and that I should have added to the commit description: When the test hangs, the test is blocked mounting the flakey device: mount -t ext4 -o acl,user_xattr /dev/mapper/flakey-test /mnt/scratch which will eventually call into ext4_ext_map_blocks(), triggering the bug. Also, some more code instrumentation shows that after the call to ext4_ext_find_hole(), the 'hole_start' will be set to '1' and 'len' to '0xfffffffe'. This '0xfffffffe' value is a bit odd, but it comes from the fact that, in ext4_ext_find_hole(), the call to ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS and 'len' will thus be set to 'EXT_MAX_BLOCKS - 1'. Does this make sense? Cheers, -- Luis