Re: xfs_repair hangs at "process newly discovered inodes..."

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 22 Nov 2022 07:48:11 +1100

On Sat, Nov 19, 2022 at 12:24:18PM -0500, iamdooser wrote:
> Thank you for responding.
> 
> Yes that found errors, although I'm not accustomed to interpreting the
> output.
> 
> xfs_repair version 5.18.0
> 
> The output of xfs_repair -nv was quite large, as was the xfs_metadump...not
> sure that's indicative of something, but I've uploaded them here:
> https://drive.google.com/drive/folders/1OyQOZNsTS1w1Utx1ZfQEH-bS_Cyj8-F2?usp=sharing

Ok....

According to the the "-nv" output, you a clean log and widespread
per-AG btree corruptions and inconsistencies. Free inodes not found
in the finobt, free space only found in on free sapce btree, records
in btrees out of order, multiply-claimed blocks (cross linked files
and cross linked free space!), etc.

Every AG shows the same corruption pattern - I've never seen a
filesystem with a clean log in this state before. This sort of
widespread lack of consistency in btree structures isn't a result of
an isolated storage media or I/O error - something major has
happened here.

The first thing I have to ask: did you zero the log with xfs_repair
because you couldn't repair it and then take these repair output
dumps? This *smells* zeroing the log with xfs_repair and throwing
away all the metadata in the log after removing a bunch of files
and the system crashing immediately afterwards. Log recovery in that
case would have made the btrees and inode states mostly
consistent...

Can you please explain how the filesystem got into this state in the
first place? What storage you have, what kernel you are running,
what distro/appliance this filesystem is hosted on, what operations
were being performed when it all went wrong, etc? We really need to
know how the fs got into this state so that we can determine if
other users are at risk of this sort of thing...

> There doesn't seem to be much activity once it hangs at "process newly
> discovered inodes..." so it doesn't seem like just a slow repair. Desipte
> there being no sign of activity, I've let it run for 24+ hours and saw no
> changes..

Use "-t 300" for xfs_repair to output a progress report every 5
minutes. Likely the operation is slow because it is IO bound moving
one inode at a time to lost+found...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx