Re: Bug in xfs_repair 5..4.0 / Unable to repair metadata corruption

Eric Sandeen <sandeen@xxxxxxxxxxx> · Sun, 9 Feb 2020 21:47:18 -0600

On 2/9/20 12:19 AM, John Jore wrote:
> Hi all,
> 
> Not sure if this is the appropriate forum to reports xfs_repair bugs? If wrong, please point me in the appropriate direction?

This is the place.

> I have a corrupted XFS volume which mounts fine, but xfs_repair is unable to repair it and volume eventually shuts down due to metadata corruption if writes are performed.

what does dmesg say when it shuts down?

> 
> Originally I used xfs_repair from CentOS 8.1.1911, but cloned latest xfs_repair from git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git (Today, Feb 9th, reports as version 5.4.0)
> 
> 
> Phase 3 - for each AG...
>         - scan and clear agi unlinked lists...
>         - 16:08:04: scanning agi unlinked lists - 64 of 64 allocation groups done
>         - process known inodes and perform inode discovery...
>         - agno = 45
>         - agno = 15
>         - agno = 0
>         - agno = 30
>         - agno = 60
>         - agno = 46
>         - agno = 16
> Metadata corruption detected at 0x4330e3, xfs_inode block 0x17312a3f0/0x2000
>         - agno = 61
>         - agno = 31
>         - agno = 47
>         - agno = 62
>         - agno = 48
>         - agno = 49
>         - agno = 32
>         - agno = 33
>         - agno = 17
>         - agno = 1
> bad magic number 0x0 on inode 18253615584
> bad version number 0x0 on inode 18253615584
> bad magic number 0x0 on inode 18253615585
> bad version number 0x0 on inode 18253615585
> bad magic number 0x0 on inode 18253615586 
> .....
> bad magic number 0x0 on inode 18253615584, resetting magic number
> bad version number 0x0 on inode 18253615584, resetting version number
> bad magic number 0x0 on inode 18253615585, resetting magic number
> bad version number 0x0 on inode 18253615585, resetting version number
> bad magic number 0x0 on inode 18253615586, resetting magic number
> bad version number 0x0 on inode 18253615586, resetting version number

Looks like a whole chunk of inodes with at least 0 magic numbers.

> ....
>         - agno = 16
>         - agno = 17
> Metadata corruption detected at 0x4330e3, xfs_inode block 0x17312a3f0/0x2000
>         - agno = 18
>         - agno = 19
> ...   
> Phase 7 - verify and correct link counts...
>         - 16:10:41: verify and correct link counts - 64 of 64 allocation groups done
> Metadata corruption detected at 0x433385, xfs_inode block 0x17312a3f0/0x2000
> libxfs_writebufr: write verifier failed on xfs_inode bno 0x17312a3f0/0x2000

This bit seems problematic, I guess it's unable to write the updated inode buffer,
due to some corruption, which presumably is why you keep tripping over the same
corruption each time.

> releasing dirty buffer (bulk) to free list!
> 
>  
> 
> Does not matter how many times, I've lost count, I re-run xfs_repair, with, or without -d,

-d is for repairing a filesystem while mounted.  I hope you are not doing that, are you?

> it never does repair the volume.
> Volume is a ~12GB LV build using 4x 4TB disks in RAID 5 using a 3Ware 9690SA controller. 

Just to double check, are there any storage errors reported in dmesg?

> Any suggestions or additional data I can provide?

If you are willing to provide an xfs_metadump to me (off-list) I will see if I can
reproduce it from the metadump. 

# xfs_metadump /dev/$WHATEVER metadump.img
# bzip2 metadump.img

-Eric

> 
> John
>