On Fri 10 May 2024 11:39:48 AM +08, Zhang Yi wrote; > On 2024/5/10 1:23, Luis Henriques wrote: >> On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote; >> >>> On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote: >>>> >>>> It's looks like it's easy to trigger an infinite loop here using fstest >>>> generic/039. If I understand it correctly (which doesn't happen as often >>>> as I'd like), this is due to an integer overflow in the 'if' condition, >>>> and should be fixed with the patch below. >>> >>> Thanks for the report. However, I can't reproduce the failure, and >>> looking at generic/039, I don't see how it could be relevant to the >>> code path in question. Generic/039 creates a test symlink with two >>> hard links in the same directory, syncs the file system, and then >>> removes one of the hard links, and then drops access to the block >>> device using dmflakey. So I don't see how the extent code would be >>> involved at all. Are you sure that you have the correct test listed? >> >> Yep, I just retested and it's definitely generic/039. I'm using a simple >> test environment, with virtme-ng. >> >>> Looking at the code in question in fs/ext4/extents.c: >>> >>> again: >>> ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start, >>> hole_start + len - 1, &es); >>> if (!es.es_len) >>> goto insert_hole; >>> >>> * There's a delalloc extent in the hole, handle it if the delalloc >>> * extent is in front of, behind and straddle the queried range. >>> */ >>> - if (lblk >= es.es_lblk + es.es_len) { >>> + if (lblk >= ((__u64) es.es_lblk) + es.es_len) { >>> /* >>> * The delalloc extent is in front of the queried range, >>> * find again from the queried start block. >>> len -= lblk - hole_start; >>> hole_start = lblk; >>> goto again; >>> >>> lblk and es.es_lblk are both __u32. So the infinite loop is >>> presumably because es.es_lblk + es.es_len has overflowed. This should >>> never happen(tm), and in fact we have a test for this case which >> >> If I instrument the code, I can see that es.es_len is definitely set to >> EXT_MAX_BLOCKS, which will overflow. >> > > Thanks for the report. After looking at the code, I think the root > cause of this issue is the variable es was not initialized on replaying > fast commit. ext4_es_find_extent_range() will return directly when > EXT4_FC_REPLAY flag is set, and then the es.len becomes stall. > > I can always reproduce this issue on generic/039 with > MKFS_OPTIONS="-O fast_commit". > > This uninitialization problem originally existed in the old > ext4_ext_put_gap_in_cache(), but it didn't trigger any real problem > since we never check and use extent cache when replaying fast commit. > So I suppose the correct fix would be to unconditionally initialize > the es variable. Oh, you're absolutely right -- the extent_status 'es' struct isn't being initialized in that case. I totally failed to see that. And yes, I also failed to mention I had 'fast_commit' feature enabled, sorry! Thanks a lot for figuring this out, Yi. I'm looking at this code and trying to understand if it would be safe to call __es_find_extent_range() when EXT4_FC_REPLAY is in progress. Probably not, and probably better to simply do: es->es_lblk = es->es_len = es->es_pblk = 0; in that case. I'll send out a patch later today. Cheers, -- Luis