On 2024/5/10 17:41, Luis Henriques wrote: > On Fri 10 May 2024 11:39:48 AM +08, Zhang Yi wrote; > >> On 2024/5/10 1:23, Luis Henriques wrote: >>> On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote; >>> >>>> On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote: >>>>> >>>>> It's looks like it's easy to trigger an infinite loop here using fstest >>>>> generic/039. If I understand it correctly (which doesn't happen as often >>>>> as I'd like), this is due to an integer overflow in the 'if' condition, >>>>> and should be fixed with the patch below. >>>> >>>> Thanks for the report. However, I can't reproduce the failure, and >>>> looking at generic/039, I don't see how it could be relevant to the >>>> code path in question. Generic/039 creates a test symlink with two >>>> hard links in the same directory, syncs the file system, and then >>>> removes one of the hard links, and then drops access to the block >>>> device using dmflakey. So I don't see how the extent code would be >>>> involved at all. Are you sure that you have the correct test listed? >>> >>> Yep, I just retested and it's definitely generic/039. I'm using a simple >>> test environment, with virtme-ng. >>> >>>> Looking at the code in question in fs/ext4/extents.c: >>>> >>>> again: >>>> ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start, >>>> hole_start + len - 1, &es); >>>> if (!es.es_len) >>>> goto insert_hole; >>>> >>>> * There's a delalloc extent in the hole, handle it if the delalloc >>>> * extent is in front of, behind and straddle the queried range. >>>> */ >>>> - if (lblk >= es.es_lblk + es.es_len) { >>>> + if (lblk >= ((__u64) es.es_lblk) + es.es_len) { >>>> /* >>>> * The delalloc extent is in front of the queried range, >>>> * find again from the queried start block. >>>> len -= lblk - hole_start; >>>> hole_start = lblk; >>>> goto again; >>>> >>>> lblk and es.es_lblk are both __u32. So the infinite loop is >>>> presumably because es.es_lblk + es.es_len has overflowed. This should >>>> never happen(tm), and in fact we have a test for this case which >>> >>> If I instrument the code, I can see that es.es_len is definitely set to >>> EXT_MAX_BLOCKS, which will overflow. >>> >> >> Thanks for the report. After looking at the code, I think the root >> cause of this issue is the variable es was not initialized on replaying >> fast commit. ext4_es_find_extent_range() will return directly when >> EXT4_FC_REPLAY flag is set, and then the es.len becomes stall. >> >> I can always reproduce this issue on generic/039 with >> MKFS_OPTIONS="-O fast_commit". >> >> This uninitialization problem originally existed in the old >> ext4_ext_put_gap_in_cache(), but it didn't trigger any real problem >> since we never check and use extent cache when replaying fast commit. >> So I suppose the correct fix would be to unconditionally initialize >> the es variable. > > Oh, you're absolutely right -- the extent_status 'es' struct isn't being > initialized in that case. I totally failed to see that. And yes, I also > failed to mention I had 'fast_commit' feature enabled, sorry! > > Thanks a lot for figuring this out, Yi. I'm looking at this code and > trying to understand if it would be safe to call __es_find_extent_range() > when EXT4_FC_REPLAY is in progress. Probably not, and probably better to > simply do: > > es->es_lblk = es->es_len = es->es_pblk = 0; > > in that case. I'll send out a patch later today. > Yeah, I'm glad it could help. Thanks, Yi.