Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 11 Oct 2022 15:26:02 +1100

On Mon, Oct 10, 2022 at 05:30:48PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 11, 2022 at 10:29:32AM +1100, Dave Chinner wrote:
> > On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote:
> > > LGTM, want to send this to the upstream list to start that discussion?
> 
> UGH, so I thought this was an internal thread, but it turns out that
> linux-xfs has been cc'd for a while but none of the messages made it to
> lore.
> 
> I'll fill in some missing context below.
> 
> > > --D
> > > 
> > > ________________________________________
> > > From: Srikanth C S <srikanth.c.s@xxxxxxxxxx>
> > > Sent: Monday, October 10, 2022 08:24
> > > To: linux-xfs@xxxxxxxxxxxxxxx; Darrick Wong
> > > Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
> > > Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
> > > 
> > > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> > > possible that when the machine crashes, the fs is in inconsistent
> > > state with the journal log not yet replayed. This can put the
> > > machine into rescue shell. To address this problem, mount and
> > > umount the fs before running xfs_repair.
> > 
> > What's the purpose of forcing xfs_repair to be run on every boot?
> > The whole point of having a journalling filesystem is to avoid
> > needing to run fsck on every boot.
> 
> I don't think repair-at-every-boot here is the overall goal for our
> customer base.
> 
> We've had some <cough> major support events over the last few months.
> There are two things going on here: some of the problems are due to are
> datacenters abending, and the rest of it are the pile of data corruption

/me has to look up what "abending" means in this context, because I
don't think you mean that the data denter was "Absent By Enforced
Net Deprivation"...

Ah, it's an ancient abbreviation from the early days of IBM
mainframes meaning "abnormal end of task". i.e. something crashed.

> problems that you and I and Chandan have been working through for months
> now.
> 
> This means that customer support has 30,000 VMs to reboot.  90% of the
> systems involved seem to have survived more or less intact, but that
> leaves 10% of them with latent errors, unmountable root filesystems,
> etc.  They probably have even more than that, but I don't recommend
> inviting the G-men for a visit to find out the real sum.
> 
> Since these machines are remotely manageable, support /can/ inject the
> kernel command line with 'fsck.mode=force' to kick off xfs_repair if the
> machine won't come up or if they suspect there might be deeper issues
> with latent errors in the fs metadata, which is what they did to try to
> get everyone running ASAP while anticipating any future problems...

This context is kinda important :)

[....]

> > More explanation, please!
> 
> Frankly, I /don't/ want to expend a lot of time wringing our hands over
> how exactly do we hammer 2022 XFS tools into 1994 ext2 behavioral
> semantics.

Neither do I - I just want to understand the problem that the
change is trying to solve. The commit message gave no indication of
why the change was being proposed; commit messages need to tell the
whole story, not describe the code change being made.  If it started
like:

"After a recent data center crash, we recently had to recover root
filesystems on several thousand VMs via a boot time fsck. Many of
them failed unnecessarily because.... "

then it becomes self evident that replaying the log in these
situations is necessary.

> What they really want is online fsck to detect and fix problems in real
> time, but I can't seem to engage the community on how exactly do we land
> this thing now that I've finished writing it.

I only want to look at the interfacing with the runtime structures
and locking to ensure we don't back ourselves into a nasty corner.
This is really only a small part of the bigger online repair
patchset.  Once we've got that sorted, I think we should just commit
the rest of it as it stands (i.e. tested but unreviewed) and fix as
needed, just like we do with the userspace xfs_repair code...

> --D
> 
> > > Signed-off-by: Srikanth C S <srikanth.c.s@xxxxxxxxxx>
> > > ---
> > >  fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
> > >  1 file changed, 18 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> > > index 6af0f22..21a8c19 100755
> > > --- a/fsck/xfs_fsck.sh
> > > +++ b/fsck/xfs_fsck.sh
> > > @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
> > >  fi
> > > 
> > >  if $FORCE; then
> > > -       xfs_repair -e $DEV
> > > -       repair2fsck_code $?
> > > +       if $AUTO; then
> > > +               xfs_repair -e $DEV
> > > +                error=$?
> > > +                if [ $error -eq 2 ]; then
> > > +                        echo "Replaying log for $DEV"
> > > +                        mkdir -p /tmp/tmp_mnt
> > > +                        mount $DEV /tmp/tmp_mnt

Need to handle mount failure here - this will need to drop to
a rescue shell for manual recovery, I think.

Also, what happens if there are mount options set like quota?
Doesn't this make more work for the recovery process if we elide
things like that here?

> > > +                        umount /tmp/tmp_mnt
> > > +                        xfs_repair -e $DEV
> > > +                        error=$?
> > > +                        rmdir /tmp/tmp_mnt
> > > +                fi
> > > +        else
> > > +                #fsck.mode=force is set but fsck.repair=no
> > > +                xfs_repair -n $DEV
> > > +                error=$?
> > > +        fi

I think that adding a "check only" mode should be a separate patch.
We have to consider different things here, such as it will run on an
unrecovered filesystem and potentially report it damaged when, in
fact, all that is needed is log recovery to run....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx