Re: Read corruption on ARM

Eric Sandeen <sandeen@xxxxxxxxxxx> · Tue, 26 Feb 2013 16:33:58 -0600

On 2/26/13 3:58 PM, Jason Detring wrote:
> Hello list,
> 
> I'm seeing filesystem read corruption on my NAS box.
> 
> My machine is an ARMv5 unit; this guy here:
>    <http://buffalo.nas-central.org/wiki/Category:LSPro>
> The hard disk is a Seagate 2TB ST32000644NS enterprise drive on the
> SoC's SATA controller.
> The unit is on a UPS and almost never sees unclean stops.
> 
> # xfs_info /dev/sda4
> meta-data=/dev/sda4              isize=256    agcount=4, agsize=121469473 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=485877892, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=237245, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> This is a "from zero" clean installation since the original HDD was lost,
> so the original factory firmware is gone.  It runs Slackware ARM (-current) now.
> The majority of the disk, 1.9T, is an unmanaged XFS mass storage partition.
> The file system was created mid-2010 by then-current tools and kernels.
> The remainder is boot, OS, /home, and scratch on ext3.
> Mass storage is always mounted ro,noatime on system startup,
> then remounted rw,noatime when I am ready to start performing operations.
> Write caching is disabled on the HDD as part of OS startup,
> usually after ro mount but before rw.
> 
> I am currently running an unpatched, vanilla 3.7.9 kernel, though this
> corruption has been going on for over a year across many quarterly
> kernel releases.
> I had been working around it, but it's just now become irritating enough for
> me to look into it.  The other unresolved ARM report from about a month ago
> was enough to prod me into action. :-)
> 
> 
> The error seems to be triggered on some directory or file lookups, but not all.
> So, some files and directores can be opened in regular userspace or via NFS,
> but others are inaccessible.  This is not one or two files; it is
> often 1/4 to 1/3 of
> the entire file system.
> Each misread item triggers a backtrace in the kernel log similiar to this:
> 
> [  465.441259] c6a59000: 58 46 53 42 00 00 10 00 00 00 00 00 1c f5 e8
> 84  XFSB............
> [  465.449461] XFS (sda4): Internal error xfs_da_do_buf(2) at line
> 2192 of file fs/xfs/xfs_da_btree.c.  Caller 0xbf05de4c
> [  465.449461]
> [  465.461982] [<c001f0f4>] (unwind_backtrace+0x0/0x12c) from
> [<bf029ff0>] (xfs_corruption_error+0x58/0x74 [xfs])
> [  465.462606] [<bf029ff0>] (xfs_corruption_error+0x58/0x74 [xfs])
> from [<bf0588fc>] (xfs_da_read_buf+0x134/0x1b0 [xfs])
> [  465.463384] [<bf0588fc>] (xfs_da_read_buf+0x134/0x1b0 [xfs]) from
> [<bf05de4c>] (xfs_dir2_leaf_readbuf+0x3a4/0x5f4 [xfs])
> [  465.464230] [<bf05de4c>] (xfs_dir2_leaf_readbuf+0x3a4/0x5f4 [xfs])
> from [<bf05e574>] (xfs_dir2_leaf_getdents+0xfc/0x3cc [xfs])
> [  465.465016] [<bf05e574>] (xfs_dir2_leaf_getdents+0xfc/0x3cc [xfs])
> from [<bf05aaec>] (xfs_readdir+0xc4/0xd0 [xfs])
> [  465.465641] [<bf05aaec>] (xfs_readdir+0xc4/0xd0 [xfs]) from
> [<bf02ac08>] (xfs_file_readdir+0x44/0x54 [xfs])
> [  465.465919] [<bf02ac08>] (xfs_file_readdir+0x44/0x54 [xfs]) from
> [<c00c9644>] (vfs_readdir+0x7c/0xac)
> [  465.465979] [<c00c9644>] (vfs_readdir+0x7c/0xac) from [<c00c9810>]
> (sys_getdents64+0x64/0xcc)
> [  465.466035] [<c00c9810>] (sys_getdents64+0x64/0xcc) from
> [<c0019080>] (ret_fast_syscall+0x0/0x2c)
> [  465.466066] XFS (sda4): Corruption detected. Unmount and run xfs_repair
> 
> I've run xfs_repair offline on the hardware itself, but the tool never
> finds problems.
> Removing the disk from the NAS and mounting it in a desktop always
> shows a clean, readable filesystem.
> 
> 
> This also seems to impact the Raspberry Pi.  Below shows a 256 MB test
> case filesystem.
> The filesystem was created on an x86-64 box by mkfs.xfs 3.1.8 and
> populated by kernel 3.6.9.
> This failure report is Linux 3.6.11-g89caf39 built by GCC 4.7.2 from
>    <https://github.com/raspberrypi/linux/commits/rpi-3.6.y>
> The problem appears to be tied to the filesystem, not the media,
> since both an external USB reader and a loopback-mounted image on the
> unit's main SD media show the same backtrace.  The loopback image was
> captured on other hardware, then copied onto the RPi via network.
> 
> # xfs_info /dev/sdb1
> meta-data=/dev/sdb1              isize=256    agcount=4, agsize=15413 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=61651, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=1200, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> [   90.638514] XFS (sdb1): Mounting Filesystem
> [   92.154824] XFS (sdb1): Ending clean mount
> [   99.010151] db027000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 f0
> d3  XFSB............
> [   99.018213] XFS (sdb1): Internal error xfs_da_do_buf(2) at line
> 2192 of file fs/xfs/xfs_da_btree.c.  Caller 0xbf1448e4

So this came out of xfs_da_read_buf(), and it thought it was reading
metadata but got something it didn't recognize.

The hex up there shows that it got what looks like xfs superblock
magic.

> [   99.018213]
> [   99.030528] Backtrace:
> [   99.030605] [<c001c1f8>] (dump_backtrace+0x0/0x10c) from
> [<c0381244>] (dump_stack+0x18/0x1c)
> [   99.030653]  r6:bf171e38 r5:bf171e38 r4:bf171dd4 r3:dce6ac40
> [   99.030998] [<c038122c>] (dump_stack+0x0/0x1c) from [<bf1105f0>]
> (xfs_error_report+0x5c/0x68 [xfs])
> [   99.031329] [<bf110594>] (xfs_error_report+0x0/0x68 [xfs]) from
> [<bf110658>] (xfs_corruption_error+0x5c/0x78 [xfs])
> [   99.031346]  r5:00000001 r4:c1abf800
> [   99.031784] [<bf1105fc>] (xfs_corruption_error+0x0/0x78 [xfs]) from
> [<bf13fa58>] (xfs_da_read_buf+0x160/0x194 [xfs])
> [   99.031800]  r6:58465342 r5:dcdd9d80 r4:00000075
> [   99.032311] [<bf13f8f8>] (xfs_da_read_buf+0x0/0x194 [xfs]) from
> [<bf1448e4>] (xfs_dir2_leaf_readbuf+0x22c/0x628 [xfs])
> [   99.032822] [<bf1446b8>] (xfs_dir2_leaf_readbuf+0x0/0x628 [xfs])

when reading a leaf format directory

> from [<bf1451ac>] (xfs_dir2_leaf_getdents+0x134/0x3d4 [xfs])
> [   99.033326] [<bf145078>] (xfs_dir2_leaf_getdents+0x0/0x3d4 [xfs])
> from [<bf141a44>] (xfs_readdir+0xdc/0xe4 [xfs])
> [   99.033742] [<bf141968>] (xfs_readdir+0x0/0xe4 [xfs]) from
> [<bf111398>] (xfs_file_readdir+0x4c/0x5c [xfs])
> [   99.033939] [<bf11134c>] (xfs_file_readdir+0x0/0x5c [xfs]) from
> [<c00f1874>] (vfs_readdir+0xa0/0xc4)
> [   99.033954]  r7:dcdd9f78 r6:c00f158c r5:00000000 r4:dcf8aee0
> [   99.034004] [<c00f17d4>] (vfs_readdir+0x0/0xc4) from [<c00f1a50>]
> (sys_getdents64+0x68/0xd8)
> [   99.034052] [<c00f19e8>] (sys_getdents64+0x0/0xd8) from
> [<c0018900>] (ret_fast_syscall+0x0/0x30)
> [   99.034066]  r7:000000d9 r6:0068ff58 r5:006882a8 r4:00000000
> [   99.034101] XFS (sdb1): Corruption detected. Unmount and run xfs_repair
> 
> # xfs_info loop/
> meta-data=/dev/loop0             isize=256    agcount=4, agsize=15413 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=61651, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=1200, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> [ 1347.630983] XFS (loop0): Mounting Filesystem
> [ 1347.745898] XFS (loop0): Ending clean mount
> [ 1351.743284] db273000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 f0
> d3  XFSB............
> [ 1351.751716] XFS (loop0): Internal error xfs_da_do_buf(2) at line
> 2192 of file fs/xfs/xfs_da_btree.c.  Caller 0xbf1448e4
> [ 1351.751716]
> [ 1351.764072] Backtrace:
> [ 1351.764148] [<c001c1f8>] (dump_backtrace+0x0/0x10c) from
> [<c0381244>] (dump_stack+0x18/0x1c)
> [ 1351.764204]  r6:bf171e38 r5:bf171e38 r4:bf171dd4 r3:c189ac40
> [ 1351.764552] [<c038122c>] (dump_stack+0x0/0x1c) from [<bf1105f0>]
> (xfs_error_report+0x5c/0x68 [xfs])
> [ 1351.764924] [<bf110594>] (xfs_error_report+0x0/0x68 [xfs]) from
> [<bf110658>] (xfs_corruption_error+0x5c/0x78 [xfs])
> [ 1351.764945]  r5:00000001 r4:c1968000
> [ 1351.765386] [<bf1105fc>] (xfs_corruption_error+0x0/0x78 [xfs]) from
> [<bf13fa58>] (xfs_da_read_buf+0x160/0x194 [xfs])
> [ 1351.765403]  r6:58465342 r5:dce25d80 r4:00000075
> [ 1351.765920] [<bf13f8f8>] (xfs_da_read_buf+0x0/0x194 [xfs]) from
> [<bf1448e4>] (xfs_dir2_leaf_readbuf+0x22c/0x628 [xfs])
> [ 1351.766432] [<bf1446b8>] (xfs_dir2_leaf_readbuf+0x0/0x628 [xfs])
> from [<bf1451ac>] (xfs_dir2_leaf_getdents+0x134/0x3d4 [xfs])
> [ 1351.766942] [<bf145078>] (xfs_dir2_leaf_getdents+0x0/0x3d4 [xfs])
> from [<bf141a44>] (xfs_readdir+0xdc/0xe4 [xfs])
> [ 1351.767363] [<bf141968>] (xfs_readdir+0x0/0xe4 [xfs]) from
> [<bf111398>] (xfs_file_readdir+0x4c/0x5c [xfs])
> [ 1351.767557] [<bf11134c>] (xfs_file_readdir+0x0/0x5c [xfs]) from
> [<c00f1874>] (vfs_readdir+0xa0/0xc4)
> [ 1351.767574]  r7:dce25f78 r6:c00f158c r5:00000000 r4:c18e57e0
> [ 1351.767622] [<c00f17d4>] (vfs_readdir+0x0/0xc4) from [<c00f1a50>]
> (sys_getdents64+0x68/0xd8)
> [ 1351.767670] [<c00f19e8>] (sys_getdents64+0x0/0xd8) from
> [<c0018900>] (ret_fast_syscall+0x0/0x30)
> [ 1351.767683]  r7:000000d9 r6:00642f58 r5:0063b2a8 r4:00000000
> [ 1351.767719] XFS (loop0): Corruption detected. Unmount and run xfs_repair
> 
> 
> 
> Here's the kicker:  All this seems to happen only if xfs.ko is
> crosscompiled with GCC 4.6 or 4.7.

urk!  That is a kicker.

> A module (just the module, the rest of kernel can be built with
> anything) compiled with
> cross-GCC 4.4.1, 4.5.4, or curiously 4.8 (20130224) has no issue at all.
> I've kept an old 2009 Sourcery G++ (4.4.1) Lite toolchain around just
> for building kernels.
> I'd really like to retire it, but I'm a little afraid this is going to
> recur in newer compilers.

Maybe you can provide an xfs.ko built with each (for the same kernel)
with debug info, and we can compare the disassembly?

> Is there something in the path lookup routine that is disagreeable to
> GCCs targeting ARM?

at one point there were some alignment issues that went on, but hat
was for old ABI, etc.  I'm not aware of anything right now.

> Any other ideas on what could be happening?

Since you got xfs superblock magic, I wonder if you read block 0
rather than the intended block, due to $SOMETHING going wrong...

Enabling the trace_xfs_da_btree_corrupt tracepoint might yield more
info, can you do that?

I think it's:

# trace-cmd -e xfs_da_btree_corrupt &
# <do your dir read>
# fg
# ^C (ctrl-c trace-cmd)
# trace-cmd report

We might get more info about the buffer in question that way.

-Eric

> Thanks,
> Jason
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs