On Sun, Sep 03, 2023 at 08:00:01PM +0800, Zorro Lang wrote: > Hi ext4 folks, > > Recently I found lots of fstests cases which belong to "recoveryloop" (e.g. > g/388 [1], g/455 [2], g/475 [3] and g/482 [4]) or does fs shutdown/resize test > (e.g. ext4/059 [5], g/530 [6]) failed ext4 with 1k blocksize, the kernel is > linux v6.6-rc0+ (HEAD=b84acc11b1c9). > > I tested with MKFS_OPTIONS="-b 1024", no specific MOUNT_OPTIONS. I hit these > failure several times, and I didn't hit them on my last regression test on > v6.5-rc7+. So I think this might be a regression problem. And I didn't hit > this failures on xfs. If this's a known issue will be fixed soon, feel free > to tell me. TL;DR: there definitely seenms to be something going on with g/455 and g/482 with the ext4/1k blocksize case in Linus's latest upstream tree, although it wasn't there in the ext4 branch which I sent to Linus to pull. Unfortunately, generic/475 is a known failure, especially in the 1k block size case. The rate seems to change a bit over time. For example from 6.2: ext4/1k: 522 tests, 2 failures, 45 skipped, 6153 seconds Flaky: generic/051: 40% (2/5) generic/475: 60% (3/5) and from 6.1.0-rc4: ext4/1k: 522 tests, 2 failures, 45 skipped, 5660 seconds Flaky: generic/051: 60% (3/5) generic/475: 40% (2/5) In 6.5-rc3, it looks like the rate has gotten worse: ext4/1k: 30 tests, 29 failures, 2402 seconds Flaky: generic/475: 97% (29/30) Alas, finding a root cause for generic/475 has been challenging. I suspect that it happens when we crash while doing a large truncate on a highly fragmented file system, such as that the truncate has to span multiple truncates, with the inode on the orphan list so the kernel can complete the truncate if we trash mid-truncate when we clean up the orphan list. However, that's just a theory, and I don't yet have hard evidence. The generic/388 test is very different. It uses the shutdown ioctl, and that's something that ext4 has never completely handled correctly. Doing it right would require adding some locks in hot paths, so it's one which I've suppressed for all of my ext4 tests[1]. [1] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/exclude The generic/455 and generic/482 tests work by using dm-log-writes, and that was *not* failing on the branch (v6.5.0-rc3-60-g768d612f7982) for which I sent a pull request to Linus: ext4/1k: 10 tests, 63 seconds generic/455 Pass 4s generic/482 Pass 8s generic/455 Pass 5s generic/482 Pass 8s generic/455 Pass 5s generic/482 Pass 7s generic/455 Pass 5s generic/482 Pass 8s generic/455 Pass 5s generic/482 Pass 8s Totals: 10 tests, 0 skipped, 0 failures, 0 errors, 63s ... but I can confirm that it's failing on Linus's upstream (I tested commit 708283abf896): ext4/1k: 2 tests, 2 failures, 31 seconds generic/455 Failed 4s generic/455 Failed 2s generic/455 Pass 5s generic/455 Failed 3s generic/455 Failed 2s generic/482 Failed 2s generic/482 Failed 3s generic/482 Failed 1s generic/482 Failed 3s generic/482 Failed 4s Totals: 10 tests, 0 skipped, 9 failures, 0 errors, 29s - Ted P.S. After doing some digging, it appears generic/455 does have some failures on 6.4 (20%) and 6.5-rc3 (5%) on the ext4/1k blocksize test config. But *something* is definitely causing a much greater failure rate in Linus's upstream. The good news is that should make it relatively easy to bisect. I'll look into it. Thanks for flagging this.