Re: XFS disaster recovery

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 8 Feb 2022 09:33:52 +1100

On Mon, Feb 07, 2022 at 05:03:03PM -0500, Sean Caron wrote:
> Hi Dave,
> 
> OK! With your patch and help on that other thread pertaining to
> xfs_metadump I was able to get a full metadata dump of this
> filesystem.
> 
> I used xfs_mdrestore to set up a sparse image for this volume using my
> dumped metadata:
> 
> xfs_mdrestore /exports/home/work/md4.metadump /exports/home/work/md4.img
> 
> Then set up a loopback device for it and tried to mount.
> 
> losetup --show --find /exports/home/work/md4.img
> mount /dev/loop0 /mnt
> 
> When I do this, I get a "Structure needs cleaning" error and the
> following in dmesg:
> 
> [523615.874581] XFS (loop0): Corruption warning: Metadata has LSN
> (7095:2330880) ahead of current LSN (7095:2328512). Please unmount and
> run xfs_repair (>= v4.3) to resolve.
> [523615.874637] XFS (loop0): Metadata corruption detected at
> xfs_agi_verify+0xef/0x180 [xfs], xfs_agi block 0x10
> [523615.874666] XFS (loop0): Unmount and run xfs_repair
> [523615.874679] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> [523615.874695] 00000000: 58 41 47 49 00 00 00 01 00 00 00 00 0f ff ff
> f8  XAGI............
> [523615.874713] 00000010: 00 03 ba 40 00 04 ef 7e 00 00 00 02 00 00 00
> 34  ...@...~.......4
> [523615.874732] 00000020: 00 30 09 40 ff ff ff ff ff ff ff ff ff ff ff
> ff  .0.@............
> [523615.874750] 00000030: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874768] 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874787] 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874806] 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874824] 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874914] XFS (loop0): metadata I/O error in
> "xfs_trans_read_buf_map" at daddr 0x10 len 8 error 117
> [523615.874998] XFS (loop0): xfs_imap_lookup: xfs_ialloc_read_agi()
> returned error -117, agno 0
> [523615.876866] XFS (loop0): Failed to read root inode 0x80, error 117

Hmmm - I think this is after log recovery. The nature of the error
(metadata LSN a few blocks larger than the current recovered LSN)
implies that part of the log was lost during device failure/recovery
and hence not recovered when mounting the filesystem.

> Seems like the next step is to just run xfs_repair (with or without
> log zeroing?) on this image and see what shakes out?

Yup.

You may be able to run it on the image file without log zeroing
after the failed mount if there were no pending intents that needed
replay.  But it doesn't matter if you do zero the log at this point,
as it's already replayed everything it can replay back into the
filesystem and it will be as consistent as it's going to get.

Regardless, you are still likely to get a bunch of "unlinked but not
freed" inode warnings and inconsistent free space because the mount
failed between the initial recovery phase and the final recovery
phase that runs intent replay and processes unlinked inodes.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx