Re: [fstests generic/388, 455, 475, 482 ...] Ext4 journal recovery test fails

"Theodore Ts'o" <tytso@xxxxxxx> · Sun, 3 Sep 2023 16:40:01 -0400

On Sun, Sep 03, 2023 at 08:00:01PM +0800, Zorro Lang wrote:
> Hi ext4 folks,
> 
> Recently I found lots of fstests cases which belong to "recoveryloop" (e.g.
> g/388 [1], g/455 [2], g/475 [3] and g/482 [4]) or does fs shutdown/resize test
> (e.g. ext4/059 [5], g/530 [6]) failed ext4 with 1k blocksize, the kernel is
> linux v6.6-rc0+ (HEAD=b84acc11b1c9).
> 
> I tested with MKFS_OPTIONS="-b 1024", no specific MOUNT_OPTIONS. I hit these
> failure several times, and I didn't hit them on my last regression test on
> v6.5-rc7+. So I think this might be a regression problem. And I didn't hit
> this failures on xfs. If this's a known issue will be fixed soon, feel free
> to tell me.

TL;DR: there definitely seenms to be something going on with g/455 and
g/482 with the ext4/1k blocksize case in Linus's latest upstream tree,
although it wasn't there in the ext4 branch which I sent to Linus to
pull.

Unfortunately, generic/475 is a known failure, especially in the 1k
block size case.  The rate seems to change a bit over time.  For
example from 6.2:

ext4/1k: 522 tests, 2 failures, 45 skipped, 6153 seconds
  Flaky: generic/051: 40% (2/5)   generic/475: 60% (3/5)

and from 6.1.0-rc4:

ext4/1k: 522 tests, 2 failures, 45 skipped, 5660 seconds
  Flaky: generic/051: 60% (3/5)   generic/475: 40% (2/5)

In 6.5-rc3, it looks like the rate has gotten worse:

ext4/1k: 30 tests, 29 failures, 2402 seconds
  Flaky: generic/475: 97% (29/30)

Alas, finding a root cause for generic/475 has been challenging.  I
suspect that it happens when we crash while doing a large truncate on
a highly fragmented file system, such as that the truncate has to span
multiple truncates, with the inode on the orphan list so the kernel
can complete the truncate if we trash mid-truncate when we clean up
the orphan list.  However, that's just a theory, and I don't yet have
hard evidence.

The generic/388 test is very different.  It uses the shutdown ioctl,
and that's something that ext4 has never completely handled correctly.
Doing it right would require adding some locks in hot paths, so it's
one which I've suppressed for all of my ext4 tests[1].

[1] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/exclude

The generic/455 and generic/482 tests work by using dm-log-writes, and
that was *not* failing on the branch (v6.5.0-rc3-60-g768d612f7982) for
which I sent a pull request to Linus:

ext4/1k: 10 tests, 63 seconds
  generic/455  Pass     4s
  generic/482  Pass     8s
  generic/455  Pass     5s
  generic/482  Pass     8s
  generic/455  Pass     5s
  generic/482  Pass     7s
  generic/455  Pass     5s
  generic/482  Pass     8s
  generic/455  Pass     5s
  generic/482  Pass     8s
Totals: 10 tests, 0 skipped, 0 failures, 0 errors, 63s

... but I can confirm that it's failing on Linus's upstream (I tested
commit 708283abf896):

ext4/1k: 2 tests, 2 failures, 31 seconds
  generic/455  Failed   4s
  generic/455  Failed   2s
  generic/455  Pass     5s
  generic/455  Failed   3s
  generic/455  Failed   2s
  generic/482  Failed   2s
  generic/482  Failed   3s
  generic/482  Failed   1s
  generic/482  Failed   3s
  generic/482  Failed   4s
Totals: 10 tests, 0 skipped, 9 failures, 0 errors, 29s

						- Ted

P.S.  After doing some digging, it appears generic/455 does have some
failures on 6.4 (20%) and 6.5-rc3 (5%) on the ext4/1k blocksize test
config.  But *something* is definitely causing a much greater failure
rate in Linus's upstream.  The good news is that should make it
relatively easy to bisect.  I'll look into it.  Thanks for flagging
this.