Re: XFS Disk Repair failing with err 117 (Help Recovering Data)

Eric Sandeen <sandeen@xxxxxxxxxxx> · Sat, 19 Sep 2020 09:21:47 -0500

On 9/19/20 8:40 AM, nitsuga5124 wrote:
> First of all, i want to say that i think this is not a hardware issue, the hard drive sounds fine and it didn't show any signs of slowness, it's also not very old, got it the 10th of January.
> 
> The entire disk is behind luks encryption:
> ```
> #lsblk
> sdc         8:32   0   3.6T  0 disk
> └─storage 254:1    0   3.6T  0 crypt
> ```
> 
> Recently i've been having some issues where the drive suddenly becomes unreadable (seems to be caused by kited) but all of the times it happened, using `xfs_repair` with the following syntax `sudo xfs_repair /dev/mapper/storage -v`; made the drive readable again, but this time i didn't have this luck, and the drive was unable to be repaired, leading to this error at the end:
> 
> ```
> Phase 6 - check inode connectivity...
>         - resetting contents of realtime bitmap and summary inodes
>         - traversing filesystem ...
>         - agno = 0
> bad hash table for directory inode 128 (no data entry): rebuilding
> rebuilding directory inode 128
> Invalid inode number 0x0
> Metadata corruption detected at 0x56135ef283e8, inode 0x84 data fork
> 
> fatal error -- couldn't map inode 132, err = 117

We need more info:

What kernel version
What xfsprogs version

What were the prior kernel messages
What were the prior xfs_repair messages

> ```
> 
> I ran `xfs_db` to see what was happening on that inode:
> `sudo xfs_db -x -c 'blockget inode 132' /dev/mapper/storage > xfs_db.log`

FWIW, "blockget inode 132" is not a valid command in xfs_db.

Perhaps you mean "blockget -i 132?"

In any case, "blockget" scans the entire disk, and apparently finds lots of
corruption along the way.

Did you unmount and open/unlock the LUKS volume before you ran repair and/or
xfs_db?

> and this are the outputs:
> 
> - stderr:
> ```
> Metadata corruption detected at 0x557ecac7bbfd, xfs_bnobt block 0x1b275ab8/0x1000
> Metadata corruption detected at 0x557ecac7bbfd, xfs_bnobt block 0x1b2d32e0/0x1000

<snip>

> ```
> 
> - stdout:
> It's a 14GB file, i will not send it.
> 
> Help trying to recover the data (move it to another drive) would be great.
> ideally i would fix the error, and i would move the data to another drive, shrink the partition, make a new partition with EXT4 instead of XFS, and move the data to the new partition, shrink, expand, and keep doing that until everything is moved to EXT4 or some other file system, or XFS again without whole drive encryption; i think this issues are occurring because using luks on XFS is not a common thing, so it's probably untested and unstable.

It really should be fine.

> 
> http://xfs.9218.n7.nabble.com/Assert-in-xfs-repair-Phase-7-and-other-xfs-restore-problems-td33368.html

(email above is from 8 years ago)

> I have found a possible solution for this problem in this mailing list.
> The solution looks to be to clear the corrupted inode (132)
> but i'm unable to find a way to do this while the drive is unmounted. Is this possible? if so, how?
> 
> https://forums.unraid.net/topic/66749-xfs-file-system-corruption-safest-way-to-fix/
> 
> I have also found this post on the unraid forums about the same error, (where i also found this email address), and the solution was the use a way older version of xfs_repair, but sadly that solution didn't work (probably because it's way too old?); Can you also confirm that the issue is fixed on xfs_repair xfs_repair version 5.8.0?

It's not clear what the issue is at this point.

XFS doesn't generally see self-induced fs-wide corruption, it is almost always the
result of something gone wrong in the layers below.

While that may sound like deflection, it's my assessment after very many years of
looking into these sorts of issues.

-Eric