Re: [PATCH v3] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 7 Nov 2022 08:55:47 -0800

On Mon, Nov 07, 2022 at 02:00:07PM +0800, Gao Xiang wrote:
> Hi folks,
> 
> On Fri, Nov 04, 2022 at 11:40:11AM +0530, Srikanth C S wrote:
> > After a recent data center crash, we had to recover root filesystems
> > on several thousands of VMs via a boot time fsck. Since these
> > machines are remotely manageable, support can inject the kernel
> > command line with 'fsck.mode=force fsck.repair=yes' to kick off
> > xfs_repair if the machine won't come up or if they suspect there
> > might be deeper issues with latent errors in the fs metadata, which
> > is what they did to try to get everyone running ASAP while
> > anticipating any future problems. But, fsck.xfs does not address the
> > journal replay in case of a crash.
> > 
> > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> > possible that when the machine crashes, the fs is in inconsistent
> > state with the journal log not yet replayed. This can drop the machine
> > into the rescue shell because xfs_fsck.sh does not know how to clean the
> > log. Since the administrator told us to force repairs, address the
> > deficiency by cleaning the log and rerunning xfs_repair.
> > 
> > Run xfs_repair -e when fsck.mode=force and repair=auto or yes.
> > Replay the logs only if fsck.mode=force and fsck.repair=yes. For
> > other option -fa and -f drop to the rescue shell if repair detects
> > any corruptions.
> > 
> > Signed-off-by: Srikanth C S <srikanth.c.s@xxxxxxxxxx>
> > ---
> >  fsck/xfs_fsck.sh | 31 +++++++++++++++++++++++++++++--
> >  1 file changed, 29 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> > index 6af0f22..62a1e0b 100755
> > --- a/fsck/xfs_fsck.sh
> > +++ b/fsck/xfs_fsck.sh
> > @@ -31,10 +31,12 @@ repair2fsck_code() {
> >  
> >  AUTO=false
> >  FORCE=false
> > +REPAIR=false
> >  while getopts ":aApyf" c
> >  do
> >         case $c in
> > -       a|A|p|y)        AUTO=true;;
> > +       a|A|p)          AUTO=true;;
> > +       y)              REPAIR=true;;
> >         f)              FORCE=true;;
> >         esac
> >  done
> > @@ -64,7 +66,32 @@ fi
> >  
> >  if $FORCE; then
> >         xfs_repair -e $DEV
> > -       repair2fsck_code $?
> > +       error=$?
> > +       if [ $error -eq 2 ] && [ $REPAIR = true ]; then
> > +               echo "Replaying log for $DEV"
> > +               mkdir -p /tmp/repair_mnt || exit 1
> > +               for x in $(cat /proc/cmdline); do
> > +                       case $x in
> > +                               root=*)
> > +                                       ROOT="${x#root=}"
> > +                               ;;
> > +                               rootflags=*)
> > +                                       ROOTFLAGS="-o ${x#rootflags=}"
> > +                               ;;
> > +                       esac
> > +               done
> > +               test -b "$ROOT" || ROOT=$(blkid -t "$ROOT" -o device)
> 
> We'd also like to get a formal solution about this for our production
> so that xfs_repair can work properly with log recovery.

My preferred solution is to port the log recovery code to userspace, and
then train xfs_repair to invoke it.  Handling the trivial case where
xfs_repair can recover logs created on the same platform as the support
script wouldn't be that hard (I think?) because log recovery is fairly
selfcontained nowadays.

But.

Inevitably someone will suggest fixing the kernel's inability to recover
a log from a platform with a different endianness, which will lead to a
discussion of making the ondisk log format endian safe.  Someone else
may also ask why not make userspace xfs_trans transactional, and... ;)

(All those extra asks are ok, but anyone taking on these task sets
should make it /very/ clear where the scope of each set begins and ends,
and in which order they'll be worked on.)

> However, may I ask if it's the preferred way to implement this which
> just acts as another mount-unmount cycle, since I'm not sure if there
> are some customized initramfs-es which could get the fs busy so that it
> won't unmount properly.

Seeing as initramfses are only supposed to turn on enough hardware so
that mount can find the root volume, I really hope there aren't
*background services* running here.

> Alternatively, do we consider another way like exporting the log
> recovery functionality with ioctl() so that log recovery can work
> without the actual fs mounting? Is it affordable?

I guess you could create a 'recoveryonly' mount option that would abort
the mount after recovering the log.  I'm not really a fan of that
approach.

--D

> Thanks,
> Gao Xiang
> 
> > +               if [ $(basename $DEV) = $(basename $ROOT) ]; then
> > +                       mount $DEV /tmp/repair_mnt $ROOTFLAGS || exit 1
> > +               else
> > +                       mount $DEV /tmp/repair_mnt || exit 1
> > +               fi
> > +               umount /tmp/repair_mnt
> > +               xfs_repair -e $DEV
> > +               error=$?
> > +               rm -d /tmp/repair_mnt
> > +       fi
> > +       repair2fsck_code $error
> >         exit $?
> >  fi
> >  
> > -- 
> > 1.8.3.1
> >