Re: e2scrub finds corruption immediately after mounting

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 3 Jan 2024 23:38:13 -0500

On Wed, Jan 03, 2024 at 04:14:36PM -0500, Brian J. Murrell wrote:
> I am trying to migrate from lvcheck
> (https://github.com/BryanKadzban/lvcheck) to using the officially
> supported e2scrub[_all] kit.

What distribution are you using, and what version of the kernel are
you using?  I note that you are using e2fsprogs 1.45.6, and Debian
Stable is shipping with e2fsprogs 1.47.0.

That being said, this is the first time I've seen any report of an
issue like what you've reported..

> # e2scrub /dev/rootvol_tmp/almalinux8_opt 
>   Logical volume "almalinux8_opt.e2scrub" created.
> e2fsck 1.45.6 (20-Mar-2020)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (86.9% non-
> contiguous), 482404/716800 blocks
> /dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption!

This error means that e2fsck exited with a non-zero exit status.
Which is strange because there is no report of any kind of problem
from e2fsck in its output.  From the e2scrub script:

check() {
	# First we recover the journal, then we see if e2fsck tries any
	# non-optimization repairs.  If either of these two returns a
	# non-zero status (errors fixed or remaining) then this fs is bad.
	E2FSCK_FIXES_ONLY=1
	export E2FSCK_FIXES_ONLY
	${DBG} "@root_sbindir@/e2fsck" -E journal_only -p ${e2fsck_opts} "${snap_dev}" || return $?
	${DBG} "@root_sbindir@/e2fsck" -f -y ${e2fsck_opts} "${snap_dev}"
}

...

check
case "$?" in
"0")
	# Clean check!
	echo "${arg}: Scrub succeeded."
  ...

"8")
	# Operational error, what now?
	echo "${arg}: e2fsck operational error."
  ...	

*)
	# fsck failed.  Check if the snapshot is invalid; if so, make a
	# note of that at the end of the log.  This isn't necessarily a
	# failure because the mounted fs could have overflowed the
	# snapshot with regular disk writes /or/ our repair process
	# could have done it by repairing too much.
	#
	# If it's really corrupt we ought to fsck at next boot.
	is_invalid="$(lvs -o lv_snapshot_invalid --noheadings "${snap_dev}" | awk '{print $1}')"
	if [ -n "${is_invalid}" ]; then
		echo "${arg}: Scrub FAILED due to invalid snapshot."
		ret=8
	else
		echo "${arg}: Scrub FAILED due to corruption!  Unmount and run e2fsck -y."
		mark_corrupt
		ret=6
	fi
	...

My best guess is that e2fsck from 1.45.6 is somehow returning a
non-zero exit status for some reason.  So the first thing I'd suggest
is upgrading to e2fsprogs 1.47.0 and see if that causes the problem to
resolve itself.

Cheers,

						- Ted