Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 10 Oct 2022 17:30:48 -0700

On Tue, Oct 11, 2022 at 10:29:32AM +1100, Dave Chinner wrote:
> On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote:
> > LGTM, want to send this to the upstream list to start that discussion?

UGH, so I thought this was an internal thread, but it turns out that
linux-xfs has been cc'd for a while but none of the messages made it to
lore.

I'll fill in some missing context below.

> > --D
> > 
> > ________________________________________
> > From: Srikanth C S <srikanth.c.s@xxxxxxxxxx>
> > Sent: Monday, October 10, 2022 08:24
> > To: linux-xfs@xxxxxxxxxxxxxxx; Darrick Wong
> > Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
> > Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
> > 
> > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> > possible that when the machine crashes, the fs is in inconsistent
> > state with the journal log not yet replayed. This can put the
> > machine into rescue shell. To address this problem, mount and
> > umount the fs before running xfs_repair.
> 
> What's the purpose of forcing xfs_repair to be run on every boot?
> The whole point of having a journalling filesystem is to avoid
> needing to run fsck on every boot.

I don't think repair-at-every-boot here is the overall goal for our
customer base.

We've had some <cough> major support events over the last few months.
There are two things going on here: some of the problems are due to are
datacenters abending, and the rest of it are the pile of data corruption
problems that you and I and Chandan have been working through for months
now.

This means that customer support has 30,000 VMs to reboot.  90% of the
systems involved seem to have survived more or less intact, but that
leaves 10% of them with latent errors, unmountable root filesystems,
etc.  They probably have even more than that, but I don't recommend
inviting the G-men for a visit to find out the real sum.

Since these machines are remotely manageable, support /can/ inject the
kernel command line with 'fsck.mode=force' to kick off xfs_repair if the
machine won't come up or if they suspect there might be deeper issues
with latent errors in the fs metadata, which is what they did to try to
get everyone running ASAP while anticipating any future problems...

> I get why one might want to occasionally force a repair check on
> boot (e.g. to repair a problem with the root filesystem), but this
> is a -rescue operation- and really shouldn't be occurring
> automatically on every boot or after a kernel crash.
>
> If it is only occurring during rescue operations, then why is it a problem
> dumping out to a shell for the admin performing rescue
> operations to deal with this directly? e.g. if the fs has a
> corrupted journal, then a mount cycle will not fix the problem and
> the admin will still get dumped into a rescue shell to fix the
> problem manually.

...however, most of those filesystems in the abended datacenter had
dirty logs, so repair failed and dumped all those machines to the
emergency shell.  3000 machines * 15 minutes per ticket is a lot of
downtime and a lot of manual labor.

Support would really like a means to automate as much of this as they
can.  They had assumed that fsck.repair=yes would DTRT to try to recover
and/or repair the fs, and were surprised to discover that this script
would not even try to recover the log.

> Hence I don't really know why anyone would be configuring their
> systems like this:

Some of the machines are VM hypervisors where unexpected crashes due to
the rootfs going offline unexpectedly create unplanned downtime for a
lot of machines.  It would be less disruptive to our cloud fleet overall
for a hypervisor take a little longer to boot if it's running fsck,
since the rootfs is the bare minimum needed to get the smartnic started.
For this case it might be preferable to configure this permanently,
since there aren't /that/ many VM hypervisors.  Nobody else gets
fsck.mode=force in the grub config.

> > Run xfs_repair -e when fsck.mode=force and repair=auto or yes.

(If it were me writing the patch, I'd have made repair=auto detect a
dirty log and continue the boot without running repair at all...)

> as it makes no sense at all for a journalling filesystem.
> 
> > If fsck.mode=force and fsck.repair=no, run xfs_repair -n without
> > replaying the logs.
> 
> Nor is it clear why anyone would want force a boot time fsck and
> then not repair the damage that might be found....

That part's my fault, I suggested that we should fix the script so that
repair=no selects dry run mode like you might expect.

> More explanation, please!

Frankly, I /don't/ want to expend a lot of time wringing our hands over
how exactly do we hammer 2022 XFS tools into 1994 ext2 behavioral
semantics.

What they really want is online fsck to detect and fix problems in real
time, but I can't seem to engage the community on how exactly do we land
this thing now that I've finished writing it.

--D

> > Signed-off-by: Srikanth C S <srikanth.c.s@xxxxxxxxxx>
> > ---
> >  fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> > index 6af0f22..21a8c19 100755
> > --- a/fsck/xfs_fsck.sh
> > +++ b/fsck/xfs_fsck.sh
> > @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
> >  fi
> > 
> >  if $FORCE; then
> > -       xfs_repair -e $DEV
> > -       repair2fsck_code $?
> > +       if $AUTO; then
> > +               xfs_repair -e $DEV
> > +                error=$?
> > +                if [ $error -eq 2 ]; then
> > +                        echo "Replaying log for $DEV"
> > +                        mkdir -p /tmp/tmp_mnt
> > +                        mount $DEV /tmp/tmp_mnt
> > +                        umount /tmp/tmp_mnt
> > +                        xfs_repair -e $DEV
> > +                        error=$?
> > +                        rmdir /tmp/tmp_mnt
> > +                fi
> > +        else
> > +                #fsck.mode=force is set but fsck.repair=no
> > +                xfs_repair -n $DEV
> > +                error=$?
> > +        fi
> > +       repair2fsck_code $error
> >          exit $?
> >  fi
> 
> As a side note, the patch has damaged whitespace....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx