Re: xfs_repair crashing (versions 3.1.4 and 3.1.5)

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 19 Apr 2011 18:27:05 +1000

On Mon, Apr 18, 2011 at 09:24:22PM +0200, Anisse Astier wrote:
> Hi,
> 
> (first of all, I'm not subscribed to the list, Please cc-me on all replies)
> 
> On an ARM NAS, using kernel 2.6.36.2 I managed to crash my root xfs partition.
> 
> xfs_repair cannot then repair this partition and is crashing itself.
> 
> # xfs_info  /dev/sda2
> meta-data=/dev/sda2              isize=256    agcount=32, agsize=7615249 blks
>          =                       sectsz=512   attr=1
> data     =                       bsize=4096   blocks=243687968, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=32768, version=1
>          =                       sectsz=512   sunit=0 blks, lazy-count=0
> realtime =none                   extsz=65536  blocks=0, rtextents=0
> 
> 
> 
> I did a SMART test to ensure the disk didn't have any bad block:
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%      8327         -
> 
> The dmesg log (on another recovery system with kernel 2.6.36-rc2) ; I
> tried to mount the system :
> [ 1003.257446] XFS mounting filesystem sda2
> [ 1003.301519] Starting XFS recovery on filesystem: sda2 (logdev: internal)
> [ 1003.303068] XFS: bad number of regions (28024) in inode log format
> [ 1003.303142] XFS: log mount/recovery failed: error 5
> [ 1003.303419] XFS: log mount failed

Something has corrupted the log....

> I then had no other choice than suppressing the log with xfs_repair -L.

Yup.

> xfs_repair crashed, but I was able to mount the filesystem(ro), but
> once I tried accessing the corrupt files, xfs would go mad:
> [13717.138896] UDF-fs: No partition found (1)
> [13717.202112] XFS mounting filesystem sda2
> [13717.274885] Ending clean XFS mount for filesystem: sda2
> [43969.970648] sshd (1039): /proc/1039/oom_adj is deprecated, please
> use /proc/1039/oom_score_adj instead.
> [107180.252602] Filesystem "sda2": corrupt dinode 805341224, (btree
> extents).  Unmount and run xfs_repair.

Quite likely, zeroing the log effectively corrupts the filesystem.

.....
> directory flags set on non-directory inode 2283178100, would fix bad flags.
> bad key in bmbt root (is 73434, would reset to 74194) in inode
> 2283178100 data fork
> bad fwd (right) sibling pointer (saw 145202888 should be NULLDFSBNO)
> Segmentation fault

Hmmm. The very next line doesn't appear before the segfault, making
me think that it's the printf that is causing it to crash.

        if (check_dups == 0 &&
                cursor.level[0].right_fsbno != NULLDFSBNO)  {
                do_warn(
        _("bad fwd (right) sibling pointer (saw %llu should be NULLDFSBNO)\n"),
                        cursor.level[0].right_fsbno);

We get this line of output.

                do_warn(
        _("\tin inode %u (%s fork) bmap btree block %llu\n"),
                        XFS_AGINO_TO_INO(mp, agno, ino), forkname,
                        cursor.level[0].fsbno);

But not this one. I wonder if passing a 64bit number to a %u format
string (shoul dbe %llu) causes problems on ARM? All the variables
are valid as they are printed or accessed elsewhere in the function,
so that's the only thing I can think of without a stack trace to
tell me otherwise....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs